Notebook

Unsupervised learning¶

$$ \newcommand{\eg}{{\it e.g.}} \newcommand{\ie}{{\it i.e.}} \newcommand{\mc}{\mathcal} \newcommand{\mb}{\mathbb} \newcommand{\mf}{\mathbf} \newcommand{\minimize}{{\text{minimize}}} \newcommand{\diag}{{\text{diag}}} \newcommand{\cond}{{\text{cond}}} \newcommand{\rank}{{\text{rank }}} \newcommand{\range}{{\mathcal{R}}} \newcommand{\null}{{\mathcal{N}}} \newcommand{\tr}{{\text{trace}}} \newcommand{\dom}{{\text{dom}}} \newcommand{\dist}{{\text{dist}}} \newcommand{\R}{\mathbf{R}} \newcommand{\Z}{\mathbf{Z}} \newcommand{\SM}{\mathbf{S}} \newcommand{\ball}{\mathcal{B}} \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\loss}{\ell} \newcommand{\eloss}{\mc{L}} \newcommand{\abs}[1]{| #1 |} \newcommand{\norm}[1]{\| #1 \|} \newcommand{\tp}{T} $$

ASE2030: Linear algebra and statistics, Inha University.

Jong-Han Kim (jonghank@inha.ac.kr)

Unsupervised learning¶

supervised learning

in supervised learning we deal with pairs of records $u,v$
goal is to predict $v$ from $u$ using a prediction model
the output records $v^i$ 'supervise' the learning of the model

unsupervised learning

in unsupervised learning, we deal with only records $u$
goal is to build a data model of $u$, in order to

reveal structure in $u$

impute missing entries (fields) in $u$

detect anomalies

so in this case we have no one to supervise the learning

Data model¶

a data model tells us what the vectors in some

data set 'look like'

can be expressed quantitatively by an _implausibility

function_ or loss function $\loss:\R^d \to \R$

$\loss(x)$ is how implausible $x$ is as a data point

$\loss(x)$ small means $x$ 'looks like' our data, or is 'typical'

$\loss(x)$ large means $x$ does not look like our data

if our model is probabilistic, i.e., $x$ comes from a density

$p(x)$, we can take $\loss (x)= -\log p(x)$, the negative log density

$\loss$ is often parametrized by a vector or matrix $\theta$, and denoted $\loss_\theta(x)$

A simple constant model¶

data model: $x$ is near a fixed vector $\theta \in \R^d$
$\theta\in \R^d$ parametrizes the model
some implausibility functions:

$\loss_\theta(x)=\|x-\theta\|_2^2=\sum_{i=1}^d (x_i-\theta_i)^2$ (square loss)

$\loss_\theta(x)=\|x-\theta\|_1=\sum_{i=1}^d |x_i-\theta_i|$ (absolute loss)

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(370)
x = np.random.randn(2,100)
x[0,:] += 0.2*x[1,:]
x += 2*np.random.rand(2,1)

x_mean = np.mean(x,axis=1)

plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x_mean[0], x_mean[1], 'o', label=r'$\theta$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()

$K$-means data model¶

data model: _$x$ is close to one of the $k$

representatives $\theta_1, \ldots, \theta_k \in \R^d$_

quantitatively: for our data points $x$, the quantity

$$ \loss_\theta(x) = \min_{i=1, \ldots, k} \|x-\theta_i\|^2 $$

i.e., the minimum distance squared to the representatives, is small

$d \times k$ matrix $\theta = \bmat{\theta_1 & \cdots & \theta_k}$ parametrizes the $k$-means data model

In [ ]:

np.random.seed(370)
x1 = np.random.randn(2,100)
x1[0,:] += 0.2*x1[1,:]
x1 += 2*np.random.rand(2,1)

x2 = np.random.randn(2,50)
x2 *= 0.6
x2 += np.array([[-1],[3]])

x3 = np.random.randn(2,200)
x3[0,:] -= 1*x3[1,:]
x3 += -3*np.random.rand(2,1)

x = np.hstack((x1,x2,x3))

x1_mean = np.mean(x1,axis=1)
x2_mean = np.mean(x2,axis=1)
x3_mean = np.mean(x3,axis=1)

plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x1_mean[0], x1_mean[1], 'o', label=r'$\theta_1$', markersize=10)
plt.plot(x2_mean[0], x2_mean[1], 'o', label=r'$\theta_2$', markersize=10)
plt.plot(x3_mean[0], x3_mean[1], 'o', label=r'$\theta_3$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()

Imputing missing entries¶

suppose $x$ has some entries missing, denoted ? or

NA or NaN

$\mathcal K \subseteq \{1,\ldots,d\}$ is the set of known entries
we use our data model to guess or impute the

missing entries

we'll denote the imputed vector as $\hat x$
$\hat x_i = x_i$ for $i \in \mathcal K$
imputation example, with $\mathcal K=\{1,3\}$

$$ x= \bmat{ 12.1 \\ {\it ?} \\ -2.3 \\ {\it ?} } \qquad \Longrightarrow \qquad \hat x = \bmat{ 12.1 \\ {\it -1.5} \\ -2.3 \\ {\it 3.4}} $$

we are imputing or guessing $\hat x_2 = -1.5$, $\hat x_4 = 3.4$
the other entries we know: $\hat x_1 = x_1= 12.1$, $\hat x_3 = x_3 = -2.3$

Imputation using a data model¶

given partially specified vector $x$ we minimize over the unknown entries:

$$ \begin{array}{ll} \mbox{minimize} & \loss_\theta(\hat x) \\ \mbox{subject to} & \hat x_i = x_i, \quad i \in \mathcal K \end{array} $$

i.e., impute the unknown entries to minimize the implausibility, subject to the given known entries

Imputing with constant data model¶

given $x$ with some entries unknown
constant data model with implausibility

function $\loss_\theta(x)=\|x-\theta\|^2$

we minimize $(\hat x_1-\theta_1)^2+ \cdots

(\hat x_d - \theta_d)^2$ subject to $\hat x_i=x_i$ for $i \in \mathcal K$

so $\hat x_i = x_i$ for $i\in \mathcal K$
for $i \not\in \mathcal K$, we take $\hat x_i = \theta_i$
i.e., for the unknown entries, guess the model parameter entries

In [ ]:

np.random.seed(370)
x = np.random.randn(2,100)
x[0,:] += 0.2*x[1,:]
x += 2*np.random.rand(2,1)

x_mean = np.mean(x,axis=1)

plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x_mean[0], x_mean[1], 'o', label=r'$\theta$', markersize=10)
x_2 = 3
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x_mean[0], x_2, 'o', label=fr'$\hat x=({x_mean[0]:4.2f},{x_2})$', markersize=10)
x_2 = -1
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x_mean[0], x_2, 'o', label=fr'$\hat x=({x_mean[0]:4.2f},{x_2})$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()

print (f'theta:{x_mean}')

theta:[1.74041577 0.82759836]

Imputing with $k$-means data model¶

given $x$ with some entries unknown
$k$-means data model with implausibility

function

$$\loss_\theta(x)=\min_{i=1,\ldots,k} \|x-\theta_i\|^2$$

find nearest representative $\theta_j$ to $x$, using only known entries
i.e., find $j$ that minimizes

$$\sum_{i\in \mathcal K} \left(x_i - (\theta_j)_i\right)^2$$

guess $\hat x_i = (\theta_j)_i$ for $i \not\in \mathcal K$
i.e., for the unknown entries,

guess the entries of the closest representative

In [ ]:

np.random.seed(370)
x1 = np.random.randn(2,100)
x1[0,:] += 0.2*x1[1,:]
x1 += 2*np.random.rand(2,1)

x2 = np.random.randn(2,50)
x2 *= 0.6
x2 += np.array([[-1],[3]])

x3 = np.random.randn(2,200)
x3[0,:] -= 1*x3[1,:]
x3 += -3*np.random.rand(2,1)

x = np.hstack((x1,x2,x3))

x1_mean = np.mean(x1,axis=1)
x2_mean = np.mean(x2,axis=1)
x3_mean = np.mean(x3,axis=1)

plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.plot(x1_mean[0], x1_mean[1], 'o', label=r'$\theta_1$', markersize=10)
plt.plot(x2_mean[0], x2_mean[1], 'o', label=r'$\theta_2$', markersize=10)
plt.plot(x3_mean[0], x3_mean[1], 'o', label=r'$\theta_3$', markersize=10)
x_2 = -2
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x3_mean[0], x_2, 'o', label=fr'$\hat x=({x3_mean[0]:4.2f},{x_2})$', markersize=10)
x_2 = 0
plt.axhline(x_2, color='r')
plt.text(6.1, x_2, fr'$x=( ?, {x_2} )$', fontsize=12)
plt.plot(x1_mean[0], x_2, 'o', label=fr'$\hat x=({x1_mean[0]:4.2f},{x_2})$', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()

Supervised learning as special case of imputation¶

suppose we wish to predict $y\in \R$ based on $x\in \R^d$
we have some training data $x^1, \ldots, x^n$, $y^1, \ldots, y^n$
define $(d+1)$-vector $\tilde x=(x,y)$
build data model for $\tilde x$ using training data

$\tilde x^1, \ldots, \tilde x^n$

to predict $y$ given $x$, impute last entry

of $\tilde x=(x,{\it ?})$

Validating imputation¶

we can validate a proposed data model (and imputation method):
divide data into a training and a test set
build data model on the training set
mask some entries in the vectors in the

test set (i.e., replace them with ?)

impute these entries and evaluate the average error or loss of the imputed values, i.e., the RMSE

Fitting data models¶

Generic fitting method¶

given data $x^1, \ldots, x^n$ (with no missing entries),

and parametrized implausibility function $\loss_\theta(x)$

how do we choose the parameter $\theta$?
average implausibility or empirical loss is

$$ \eloss(\theta) = \frac{1}{n} \sum_{i=1}^n \loss_\theta(x^i) $$

choose $\theta$ to minimize $\eloss (\theta)$,

(possibly) subject to $\theta \in \Theta$, the set of acceptable parameters

i.e., choose parameter $\theta$ so the observed data is least implausible

Fitting a constant model with sum squares loss¶

sum squares implausibility function $\loss_\theta(x)=\|x-\theta\|^2$
empirical loss is

$$ \eloss(\theta) = \frac{1}{n} \sum_{i=1}^n \|x^i-\theta\|^2 $$

minimizing over $\theta$ yields

$$ \theta = \frac{1}{n}\sum_{i=1}^nx^i $$

the mean of the data vectors

Fitting a constant model with sum absolute loss¶

sum absolute implausibility function $\loss_\theta(x)=\|x-\theta\|_1$
empirical loss is

$$ \eloss(\theta) = \frac{1}{n} \sum_{i=1}^n \|x^i-\theta\|_1 $$

minimizing over $\theta$ yields

$$ \theta = \text{median} (x^1, \ldots,x^n) $$

the elementwise median of the data vectors

Fitting a $k$-means model¶

implausibility function $\loss_\theta(x)=\min_{j=1,\ldots, k}

|x-\theta_j|^2$

parameter is $d \times k$ matrix with columns $\theta_1, \ldots, \theta_k$
empirical loss is

$$ \eloss(\theta) = \frac{1}{n} \sum_{i=1}^n \min_{j=1,\ldots, k} \|x^i-\theta_j\|^2 $$

this is the $k$-means objective function!
we can use the $k$-means algorithm to (approximately) minimize $\eloss(\theta)$, i.e., fit a $k$-means model

$k$-means algorithm¶

define the assignment or clustering vector

$c\in \Z^n$

$c_i$ is the cluster that data vector $x^i$ is in

(so $c_i \in \{1,\ldots, k\}$)

to minimize

$$ \eloss(\theta) = \frac{1}{n} \sum_{i=1}^n \min_{j=1,\ldots, k} \|x^i-\theta_j\|^2 $$

we minimize $\frac{1}{n} \sum_{i=1}^n \|x^i-\theta_{c_i}\|^2$ over both $c$ and $\theta_1, \ldots, \theta_k$

we can minimize over $c$ using $c_i = \underset{j}{\text{argmin}} \|x^i-\theta_j\|^2$

we can minimize over $\theta_1, \ldots, \theta_k$

using $\theta_i$ as the average of $\{ x^j \mid c_j = i\}$

$k$-means algorithm alternates between these two steps
it is a heuristic for (approximately) minimizing $\eloss(\theta)$

$k$-means example¶

Given the data points below (which look like that they came from three different processes),

In [ ]:

plt.figure(figsize=(6,6))
plt.plot(x[0,:], x[1,:], 'o', alpha=0.5, label='data')
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.legend(), plt.grid()
plt.show()

we fit a $k$-means model with $k=3$.

In [ ]:

np.random.seed(2)

k = 3
d, n = x.shape 
c = np.random.randint(k, size=(n,))

# initialization
theta = x[:,np.random.permutation(n)[:k]]

plt.figure(figsize=(18,12))

maxiter = 6
for l in range(maxiter):
  # assignment step
  dist = np.zeros((k,n))
  for i in range(k):
    dist[i,:] = np.linalg.norm((x - theta[:,i].reshape(d,1)), axis=0)
  c = np.argmin(dist, axis=0)
  
  plt.subplot(2,3,l+1)
  for i in range(k):
    plt.plot(x[0,c==i], x[1,c==i], 'o', alpha=0.5)
    plt.plot(theta[0,i], theta[1,i], 'o', markersize=10)
  plt.xlim([-6, 6]), plt.ylim([-6, 6])
  plt.title(f'iteration number: {l}'), plt.grid()

  # update step
  for i in range(k):
    theta[:,i] = np.mean(x[:,c==i], axis=1)

plt.xlim([-6, 6])    
plt.ylim([-6, 6])    
plt.show()

A more efficient $k$-means fitting can be done via the scikit-learn package.

In [ ]:

import sklearn.cluster as skc

model = skc.KMeans(n_clusters=k, max_iter=6).fit(x.T)
theta = model.cluster_centers_.T
c = model.labels_

plt.figure(figsize=(6,6))
for i in range(k):
  plt.plot(x[0,c==i], x[1,c==i], 'o', alpha=0.5)
  plt.plot(theta[0,i], theta[1,i], 'o', markersize=10)
plt.xlim([-6, 6]), plt.ylim([-6, 6])
plt.title(r'$k$-means via scikit-learn package'), plt.grid()
plt.xlim([-6, 6])    
plt.ylim([-6, 6])    
plt.show()