Important: Please read the installation page for details about how to install the toolboxes. $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$ $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\choice}[1]{ \left\{ \begin{array}{l} #1 \end{array} \right. }$ $\newcommand{\pa}[1]{\left(#1\right)}$ $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$ $\newcommand{\qandq}{\quad\text{and}\quad}$ $\newcommand{\qwhereq}{\quad\text{where}\quad}$ $\newcommand{\qifq}{ \quad \text{if} \quad }$ $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$ $\newcommand{\ZZ}{\mathbb{Z}}$ $\newcommand{\CC}{\mathbb{C}}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\EE}{\mathbb{E}}$ $\newcommand{\Zz}{\mathcal{Z}}$ $\newcommand{\Ww}{\mathcal{W}}$ $\newcommand{\Vv}{\mathcal{V}}$ $\newcommand{\Nn}{\mathcal{N}}$ $\newcommand{\NN}{\mathcal{N}}$ $\newcommand{\Hh}{\mathcal{H}}$ $\newcommand{\Bb}{\mathcal{B}}$ $\newcommand{\Ee}{\mathcal{E}}$ $\newcommand{\Cc}{\mathcal{C}}$ $\newcommand{\Gg}{\mathcal{G}}$ $\newcommand{\Ss}{\mathcal{S}}$ $\newcommand{\Pp}{\mathcal{P}}$ $\newcommand{\Ff}{\mathcal{F}}$ $\newcommand{\Xx}{\mathcal{X}}$ $\newcommand{\Mm}{\mathcal{M}}$ $\newcommand{\Ii}{\mathcal{I}}$ $\newcommand{\Dd}{\mathcal{D}}$ $\newcommand{\Ll}{\mathcal{L}}$ $\newcommand{\Tt}{\mathcal{T}}$ $\newcommand{\si}{\sigma}$ $\newcommand{\al}{\alpha}$ $\newcommand{\la}{\lambda}$ $\newcommand{\ga}{\gamma}$ $\newcommand{\Ga}{\Gamma}$ $\newcommand{\La}{\Lambda}$ $\newcommand{\si}{\sigma}$ $\newcommand{\Si}{\Sigma}$ $\newcommand{\be}{\beta}$ $\newcommand{\de}{\delta}$ $\newcommand{\De}{\Delta}$ $\newcommand{\phi}{\varphi}$ $\newcommand{\th}{\theta}$ $\newcommand{\om}{\omega}$ $\newcommand{\Om}{\Omega}$ $\newcommand{\eqdef}{\equiv}$
This tour details the logistic classification method (for 2 classes and multi-classes).
Warning: Logisitic classification is actually called "logistic regression" in the literature, but it is in fact a classification method.
We recommend that after doing this Numerical Tours, you apply it to your own data, for instance using a dataset from LibSVM.
Disclaimer: these machine learning tours are intended to be overly-simplistic implementations and applications of baseline machine learning methods. For more advanced uses and implementations, we recommend to use a state-of-the-art library, the most well known being Scikit-Learn
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
We define a few helpers.
def find(x): return np.nonzero(x)[0]
Logistic classification is, with support vector machine (SVM), the baseline method to perform classification. Its main advantage over SVM is that is is a smooth minimization problem, and that it also output class probabity, offering a probabilistic interpretation of the classification.
To understand the behavior of the method, we generate synthetic data distributed according to a mixture of Gaussian with an overlap governed by an offset $\omega$. Here classes indexes are set to $y_i \in \{-1,1\}$ to simplify the equations.
n = 1000 # number of sample
p = 2 # dimensionality
omega = np.array([1,.5])*2.5 # offset
n1 = int(n/2)
X = np.vstack(( np.random.randn(n1,2), np.random.randn(n1,2)+np.ones([n1,1])*omega ))
y = np.vstack(( np.ones([n1,1]), -np.ones([n1,1]) ))
Plot the classes.
I = find(y==-1)
J = find(y==1)
plt.clf
plt.plot(X[I,0], X[I,1], '.')
plt.plot(X[J,0], X[J,1], '.')
plt.axis('equal');
Logistic classification minimize a logistic loss in place of the usual $\ell^2$ loss for regression $$ \umin{w} E(w) \eqdef \frac{1}{n} \sum_{i=1}^n L(\dotp{x_i}{w},y_i) $$ where the logistic loss reads $$ L( s,y ) \eqdef \log( 1+\exp(-sy) ) $$ This corresponds to a smooth convex minimization. If $X$ is injective, this is also strictly convex, hence it has a single global minimum.
Compare the binary (ideal) 0-1 loss, the logistic loss and the <https://en.wikipedia.org/wiki/Hinge_loss hinge loss> (the one used for SVM).
t = np.linspace(-3,3,255).transpose()
plt.clf
plt.plot(t, t>0)
plt.plot(t, np.log(1+np.exp(t)))
plt.plot(t, np.maximum(t,0) )
plt.axis('tight');
plt.legend(['Binary', 'Logistic', 'Hinge']);
This can be interpreted as a <https://en.wikipedia.org/wiki/Maximum_likelihood_estimation maximum likelihood estimator> when one models the probability of belonging to the two classes for sample $x_i$ as $$ h(x_i) \eqdef (\th(x_i),1-\th(x_i)) \qwhereq \th(s) \eqdef \frac{e^{s}}{1+e^s} = (1+e^{-s})^{-1} $$
Re-writting the energy to minimize $$ E(w) = \Ll(X w,y) \qwhereq \Ll(s,y)= \frac{1}{n} \sum_i L(s_i,y_i), $$ its gradient reads $$ \nabla E(w) = X^\top \nabla \Ll(X w,y) \qwhereq \nabla \Ll(s,y) = \frac{y}{n} \odot \th(-y \odot s), $$ where $\odot$ is the pointwise multiplication operator, i.e. * in Python.
Define the energies.
def L(s,y): return 1/n * sum( np.log( 1 + np.exp(-s*y) ) )
def E(w,X,y): return L(X.dot(w),y);
Define their gradients.
def theta(v): return 1 / (1+np.exp(-v))
def nablaL(s,r): return - 1/n * y * theta(-s * y)
def nablaE(w,X,y): return X.transpose().dot( nablaL(X.dot(w),y) )
Important: in order to improve performance, it is important (especially in low dimension $p$) to add a constant bias term $w_{p+1} \in \RR$, and replace $\dotp{x_i}{w}$ by $ \dotp{x_i}{w} + w_{p+1} $. This is equivalently achieved by adding an extra $(p+1)^{\text{th}}$ dimension equal to 1 to each $x_i$, which we do using a convenient macro.
def AddBias(X): return np.hstack(( X, np.ones((np.size(X,0),1)) ))
With this added bias term, once $w_{\ell=0} \in \RR^{p+1}$ initialized (for instance at $0_{p+1}$),
w = np.zeros((p+1,1))
Perform one step of gradient descent reads $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$
tau = 1; # here we are using a fixed tau
w = w - tau * nablaE(w,AddBias(X),y)
If one chooses $$\tau < \tau_{\max} \eqdef \frac{2}{\frac{1}{4}\norm{X}^2},$$ then one is sure that the gradient descent converges.
np.linalg.norm(X)
tau_max = 2/(1/4 * np.linalg.norm(AddBias(X), 2)**2 )
print(tau_max)
0.0014965184333816531
Exercise 1
Implement a gradient descent $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$ Monitor the energy decay. Test different step size, and compare with the theory (in particular plot in log domain to illustrate the linear rate). etAR(1); etAR(1);
run -i nt_solutions/ml_3_classification/exo1
print(w)
[[-2.41929334] [-1.0750284 ] [ 3.65553274]]
## Insert your code here.
Generate a 2D grid of points.
q = 201
tx = np.linspace( X[:,0].min(), X[:,0].max(),num=q)
ty = np.linspace( X[:,1].min(), X[:,1].max(),num=q)
[B,A] = np.meshgrid( ty,tx )
G = np.vstack([A.flatten(), B.flatten()]).transpose()
Evaluate class probability associated to weight vectors on this grid.
Theta = theta(AddBias(G).dot(w))
Theta = Theta.reshape((q,q))
Display the data overlaid on top of the classification probability, this highlight the separating hyperplane $ \enscond{x}{\dotp{w}{x}=0} $.
plt.clf
plt.imshow(Theta.transpose(), origin="lower", extent=[tx.min(),tx.max(),ty.min(),ty.max()])
plt.axis('equal')
plt.plot(X[I,0], X[I,1], '.')
plt.plot(X[J,0], X[J,1], '.')
plt.axis('off');
Exercise 2
Test the influence of the separation offset $\omega$ on the result.
run -i nt_solutions/ml_3_classification/exo2
## Insert your code here.
Exercise 3
Test logistic classification on a real life dataset. You can look at the Numerical Tour on stochastic gradient descent for an example. Split the data in training and testing to evaluate the classification performance, and check the impact of regularization.
run -i nt_solutions/ml_3_classification/exo3
## Insert your code here.
Logistic classification tries to separate the classes using a linear separating hyperplane $ \enscond{x}{\dotp{w}{x}=0}. $
In order to generate a non-linear descision boundary, one can replace the parametric linear model by a non-linear non-parametric model, thanks to kernelization. It is non-parametric in the sense that the number of parameter grows with the number $n$ of sample (while for the basic method, the number of parameter is $p$. This allows in particular to generate decision boundary of arbitrary complexity.
The downside is that the numerical complexity of the method grows (at least) quadratically with $n$.
The good news however is that thanks to the theory of reproducing kernel Hilbert spaces (RKHS), one can still compute this non-linear decision function using (almost) the same numerical algorithm.
Given a kernel $ \kappa(x,z) \in \RR $ defined for $(x,z) \in \RR^p$, the kernelized method replace the linear decision functional $f(x) = \dotp{x}{w}$ by a sum of kernel centered on the samples $$ f_h(x) = \sum_{i=1}^p h_i k(x_i,x) $$ where $h \in \RR^n$ is the unknown vector of weight to find.
When using the linear kernel $\kappa(x,y)=\dotp{x}{y}$, one retrieves the previously studied linear method.
Macro to compute pairwise squared Euclidean distance matrix.
# slow
def distmat1(X,Z):
D = np.zeros((X.shape[0],Z.shape[0]))
for i in np.arange(0,X.shape[0]):
for j in np.arange(0,Z.shape[0]):
D[i,j] = np.linalg.norm( X[i,:]-Z[j,:] );
return D
# fast
from scipy import spatial
def distmat(X,Z): return spatial.distance.cdist(X,Z)**2
The gaussian kernel is the most well known and used kernel $$ \kappa(x,y) \eqdef e^{-\frac{\norm{x-y}^2}{2\sigma^2}} . $$ The bandwidth parameter $\si>0$ is crucial and controls the locality of the model. It is typically tuned through cross validation.
def kappa(X,Z,sigma): return np.exp( -distmat(X,Z)/(2*sigma**2) )
We generate synthetic data in 2-D which are not separable by an hyperplane.
n = 1000
p = 2;
t = 2*np.pi*np.random.randn(n1,1);
R = 2.5;
r = R*(1.5 + .2*np.random.randn(n1,1)); # radius
X1 = np.hstack((np.cos(t)*r, np.sin(t)*r));
X = np.vstack((np.random.randn(n1,2), X1))
y = np.vstack(( np.ones([n1,1]), -np.ones([n1,1]) ))
Display the classes.
I = find(y==-1)
J = find(y==1)
plt.plot(X[I,0], X[I,1], '.')
plt.plot(X[J,0], X[J,1], '.')
plt.axis('equal')
plt.axis('off');
Once avaluated on grid points, the kernel define a matrix $$ K = (\kappa(x_i,x_j))_{i,j=1}^n \in \RR^{n \times n}. $$
sigma = 1;
K = kappa(X,X,sigma)
plt.imshow(K);
Valid kernels are those that gives rise to positive symmetric matrices $K$. The linear and Gaussian kernel are valid kernel functions. Other popular kernels include the polynomial kernel $ \dotp{x}{y}^a $ for $a \geq 1$ and the Laplacian kernel $ \exp( -\norm{x-y}^2/\si ) $.
The kernelized Logistic minimization reads $$ \umin{h} F(h) \eqdef \Ll(K h,y). $$
def F(h,K,y): return L(K.dot(h),y)
def nablaF(h,K,y): return K.transpose().dot( nablaL(K.dot(h),y) )
This minimization can be related to an infinite dimensional optimization problem where one minimizes directly over the function $f$. This is shown to be equivalent to the above finite-dimenisonal optimization problem thanks to the theory of RKHS.
Exercise 4
Implement a gradient descent to minimize $F(h)$. Monitor the energy decay. Test different step size, and compare with the theory.
run -i nt_solutions/ml_3_classification/exo4
## Insert your code here.
Once this optimal $h$ has been found, class probability at a point $x$ are obtained as $$ (\th(f_h(x)), 1-\th(f_h(x)) $$ where $f_h$ has been defined above.
We evaluate this classification probability on a grid.
q = 201
tmax = 5
t = np.linspace(-tmax,tmax,num=q)
[B,A] = np.meshgrid( t,t )
G = np.vstack([A.flatten(), B.flatten()]).transpose()
K1 = kappa(G,X,sigma)
Theta = theta( K1.dot(h) )
Theta = Theta.reshape((q,q))
Display the classification probability.
plt.clf
plt.imshow(Theta.transpose(), origin="lower", extent=[-tmax, tmax, -tmax, tmax])
plt.plot(X[I,0], X[I,1], '.')
plt.plot(X[J,0], X[J,1], '.')
plt.axis('equal')
plt.axis('off');