supervised learning
in supervised learning we deal with pairs of records $u,v$
goal is to predict $v$ from $u$ using a prediction model
the output records $v^i$ 'supervise' the learning of the model
unsupervised learning
in unsupervised learning, we deal with only records $u$
goal is to build a data model of $u$, in order to
- reveal structure in $u$
- impute missing entries (fields) in $u$
- detect anomalies
data set 'look like'
function_ or loss function $\loss:\R^d \to \R$
- $\loss(x)$ small means $x$ 'looks like' our data, or is 'typical'
- $\loss(x)$ large means $x$ does not look like our data
$p(x)$, we can take $\loss (x)= -\log p(x)$, the negative log density
data model: $x$ is near a fixed vector $\theta \in \R^d$
$\theta\in \R^d$ parametrizes the model
some implausibility functions:
- $\loss_\theta(x)=\|x-\theta\|_2^2=\sum_{i=1}^d (x_i-\theta_i)^2$ (square loss)
- $\loss_\theta(x)=\|x-\theta\|_1=\sum_{i=1}^d |x_i-\theta_i|$ (absolute loss)
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(370)
x = np.random.randn(2,100)
x[0,:] += 0.2*x[1,:]
x += 2*np.random.rand(2,1)
x_mean = np.mean(x,axis=1)
plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x_mean[0], x_mean[1], 'o', label=r'$\theta$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()
representatives $\theta_1, \ldots, \theta_k \in \R^d$_
i.e., the minimum distance squared to the representatives, is small
np.random.seed(370)
x1 = np.random.randn(2,100)
x1[0,:] += 0.2*x1[1,:]
x1 += 2*np.random.rand(2,1)
x2 = np.random.randn(2,50)
x2 *= 0.6
x2 += np.array([[-1],[3]])
x3 = np.random.randn(2,200)
x3[0,:] -= 1*x3[1,:]
x3 += -3*np.random.rand(2,1)
x = np.hstack((x1,x2,x3))
x1_mean = np.mean(x1,axis=1)
x2_mean = np.mean(x2,axis=1)
x3_mean = np.mean(x3,axis=1)
plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x1_mean[0], x1_mean[1], 'o', label=r'$\theta_1$', markersize=10)
plt.plot(x2_mean[0], x2_mean[1], 'o', label=r'$\theta_2$', markersize=10)
plt.plot(x3_mean[0], x3_mean[1], 'o', label=r'$\theta_3$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()
NA or NaN
$\mathcal K \subseteq \{1,\ldots,d\}$ is the set of known entries
we use our data model to guess or impute the
missing entries
we'll denote the imputed vector as $\hat x$
$\hat x_i = x_i$ for $i \in \mathcal K$
imputation example, with $\mathcal K=\{1,3\}$
given $x$ with some entries unknown
constant data model with implausibility
function $\loss_\theta(x)=\|x-\theta\|^2$
so $\hat x_i = x_i$ for $i\in \mathcal K$
for $i \not\in \mathcal K$, we take $\hat x_i = \theta_i$
i.e., for the unknown entries, guess the model parameter entries
np.random.seed(370)
x = np.random.randn(2,100)
x[0,:] += 0.2*x[1,:]
x += 2*np.random.rand(2,1)
x_mean = np.mean(x,axis=1)
plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x_mean[0], x_mean[1], 'o', label=r'$\theta$', markersize=10)
x_2 = 3
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x_mean[0], x_2, 'o', label=fr'$\hat x=({x_mean[0]:4.2f},{x_2})$', markersize=10)
x_2 = -1
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x_mean[0], x_2, 'o', label=fr'$\hat x=({x_mean[0]:4.2f},{x_2})$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()
print (f'theta:{x_mean}')
theta:[1.74041577 0.82759836]
given $x$ with some entries unknown
$k$-means data model with implausibility
function
$$\loss_\theta(x)=\min_{i=1,\ldots,k} \|x-\theta_i\|^2$$find nearest representative $\theta_j$ to $x$, using only known entries
i.e., find $j$ that minimizes
guess $\hat x_i = (\theta_j)_i$ for $i \not\in \mathcal K$
i.e., for the unknown entries,
guess the entries of the closest representative
np.random.seed(370)
x1 = np.random.randn(2,100)
x1[0,:] += 0.2*x1[1,:]
x1 += 2*np.random.rand(2,1)
x2 = np.random.randn(2,50)
x2 *= 0.6
x2 += np.array([[-1],[3]])
x3 = np.random.randn(2,200)
x3[0,:] -= 1*x3[1,:]
x3 += -3*np.random.rand(2,1)
x = np.hstack((x1,x2,x3))
x1_mean = np.mean(x1,axis=1)
x2_mean = np.mean(x2,axis=1)
x3_mean = np.mean(x3,axis=1)
plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x1_mean[0], x1_mean[1], 'o', label=r'$\theta_1$', markersize=10)
plt.plot(x2_mean[0], x2_mean[1], 'o', label=r'$\theta_2$', markersize=10)
plt.plot(x3_mean[0], x3_mean[1], 'o', label=r'$\theta_3$', markersize=10)
x_2 = -2
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x3_mean[0], x_2, 'o', label=fr'$\hat x=({x3_mean[0]:4.2f},{x_2})$', markersize=10)
x_2 = 0
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x1_mean[0], x_2, 'o', label=fr'$\hat x=({x1_mean[0]:4.2f},{x_2})$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()
suppose we wish to predict $y\in \R$ based on $x\in \R^d$
we have some training data $x^1, \ldots, x^n$, $y^1, \ldots, y^n$
define $(d+1)$-vector $\tilde x=(x,y)$
build data model for $\tilde x$ using training data
$\tilde x^1, \ldots, \tilde x^n$
of $\tilde x=(x,{\it ?})$
we can validate a proposed data model (and imputation method):
divide data into a training and a test set
build data model on the training set
mask some entries in the vectors in the
test set (i.e., replace them with ?)
and parametrized implausibility function $\loss_\theta(x)$
how do we choose the parameter $\theta$?
average implausibility or empirical loss is
(possibly) subject to $\theta \in \Theta$, the set of acceptable parameters
sum squares implausibility function $\loss_\theta(x)=\|x-\theta\|^2$
empirical loss is
the mean of the data vectors
sum absolute implausibility function $\loss_\theta(x)=\|x-\theta\|_1$
empirical loss is
the elementwise median of the data vectors
|x-\theta_j|^2$
parameter is $d \times k$ matrix with columns $\theta_1, \ldots, \theta_k$
empirical loss is
this is the $k$-means objective function!
we can use the $k$-means algorithm to (approximately) minimize $\eloss(\theta)$, i.e., fit a $k$-means model
$c\in \Z^n$
(so $c_i \in \{1,\ldots, k\}$)
we minimize $\frac{1}{n} \sum_{i=1}^n \|x^i-\theta_{c_i}\|^2$ over both $c$ and $\theta_1, \ldots, \theta_k$
- we can minimize over $c$ using $c_i = \underset{j}{\text{argmin}} \|x^i-\theta_j\|^2$
- we can minimize over $\theta_1, \ldots, \theta_k$
using $\theta_i$ as the average of $\{ x^j \mid c_j = i\}$
$k$-means algorithm alternates between these two steps
it is a heuristic for (approximately) minimizing $\eloss(\theta)$
Given the data points below (which look like that they came from three different processes),
plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()
we fit a $k$-means model with $k=3$.
np.random.seed(2)
k = 3
d, n = x.shape
c = np.random.randint(k, size=(n,))
# initialization
theta = x[:,np.random.permutation(n)[:k]]
plt.figure(figsize=(18,12))
maxiter = 6
for l in range(maxiter):
# assignment step
dist = np.zeros((k,n))
for i in range(k):
dist[i,:] = np.linalg.norm((x - theta[:,i].reshape(d,1)), axis=0)
c = np.argmin(dist, axis=0)
plt.subplot(2,3,l+1)
for i in range(k):
plt.plot(x[0,c==i], x[1,c==i], 'o', alpha=0.5)
plt.plot(theta[0,i], theta[1,i], 'o', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.title(f'iteration number: {l}'), plt.grid()
# update step
for i in range(k):
theta[:,i] = np.mean(x[:,c==i], axis=1)
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.show()
A more efficient $k$-means fitting can be done via the scikit-learn
package.
import sklearn.cluster as skc
model = skc.KMeans(n_clusters=k, max_iter=6).fit(x.T)
theta = model.cluster_centers_.T
c = model.labels_
plt.figure(figsize=(6,6))
for i in range(k):
plt.plot(x[0,c==i], x[1,c==i], 'o', alpha=0.5)
plt.plot(theta[0,i], theta[1,i], 'o', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.title(r'$k$-means via scikit-learn package'), plt.grid()
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.show()