Important: Please read the installation page for details about how to install the toolboxes. $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$ $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\choice}[1]{ \left\{ \begin{array}{l} #1 \end{array} \right. }$ $\newcommand{\pa}[1]{\left(#1\right)}$ $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$ $\newcommand{\qandq}{\quad\text{and}\quad}$ $\newcommand{\qwhereq}{\quad\text{where}\quad}$ $\newcommand{\qifq}{ \quad \text{if} \quad }$ $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$ $\newcommand{\ZZ}{\mathbb{Z}}$ $\newcommand{\CC}{\mathbb{C}}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\EE}{\mathbb{E}}$ $\newcommand{\Zz}{\mathcal{Z}}$ $\newcommand{\Ww}{\mathcal{W}}$ $\newcommand{\Vv}{\mathcal{V}}$ $\newcommand{\Nn}{\mathcal{N}}$ $\newcommand{\NN}{\mathcal{N}}$ $\newcommand{\Hh}{\mathcal{H}}$ $\newcommand{\Bb}{\mathcal{B}}$ $\newcommand{\Ee}{\mathcal{E}}$ $\newcommand{\Cc}{\mathcal{C}}$ $\newcommand{\Gg}{\mathcal{G}}$ $\newcommand{\Ss}{\mathcal{S}}$ $\newcommand{\Pp}{\mathcal{P}}$ $\newcommand{\Ff}{\mathcal{F}}$ $\newcommand{\Xx}{\mathcal{X}}$ $\newcommand{\Mm}{\mathcal{M}}$ $\newcommand{\Ii}{\mathcal{I}}$ $\newcommand{\Dd}{\mathcal{D}}$ $\newcommand{\Ll}{\mathcal{L}}$ $\newcommand{\Tt}{\mathcal{T}}$ $\newcommand{\si}{\sigma}$ $\newcommand{\al}{\alpha}$ $\newcommand{\la}{\lambda}$ $\newcommand{\ga}{\gamma}$ $\newcommand{\Ga}{\Gamma}$ $\newcommand{\La}{\Lambda}$ $\newcommand{\si}{\sigma}$ $\newcommand{\Si}{\Sigma}$ $\newcommand{\be}{\beta}$ $\newcommand{\de}{\delta}$ $\newcommand{\De}{\Delta}$ $\newcommand{\phi}{\varphi}$ $\newcommand{\th}{\theta}$ $\newcommand{\om}{\omega}$ $\newcommand{\Om}{\Omega}$ $\newcommand{\eqdef}{\equiv}$
This tour details the logistic classification method (for 2 classes and multi-classes).
Warning: Logisitic classification is actually called "logistic regression" in the literature, but it is in fact a classification method.
We recommend that after doing this Numerical Tours, you apply it to your own data, for instance using a dataset from LibSVM.
Disclaimer: these machine learning tours are intended to be overly-simplistic implementations and applications of baseline machine learning methods. For more advanced uses and implementations, we recommend to use a state-of-the-art library, the most well known being Scikit-Learn
options(warn=-1) # turns off warnings, to turn on: "options(warn=0)"
library(plot3D)
library(pracma)
library(grid)
# Importing the libraries
for (f in list.files(path="nt_toolbox/toolbox_general/", pattern="*.R")) {
source(paste("nt_toolbox/toolbox_general/", f, sep=""))
}
for (f in list.files(path="nt_toolbox/toolbox_signal/", pattern="*.R")) {
source(paste("nt_toolbox/toolbox_signal/", f, sep=""))
}
We define a few helpers.
Xm = function(X){as.matrix(X - rep(colMeans(X), rep.int(nrow(X), ncol(X))))}
Cov = function(X){data.matrix(1. / (n - 1) * t(Xm(X)) %*% Xm(X))}
Logistic classification is, with support vector machine (SVM), the baseline method to perform classification. Its main advantage over SVM is that is is a smooth minimization problem, and that it also output class probabity, offering a probabilistic interpretation of the classification.
To understand the behavior of the method, we generate synthetic data distributed according to a mixture of Gaussian with an overlap governed by an offset $\omega$. Here classes indexes are set to $y_i \in \{-1,1\}$ to simplify the equations.
n = 1000 # number of sample
p = 2 # dimensionality
omega = 1.5 * 2.5 #offset
n1 = n/2
X = rbind(randn(n1,2), randn(n1,2) + rep(1, n1) * omega)
y = c(rep(1, n1), rep(-1, n1))
Plot the classes.
options(repr.plot.width=5, repr.plot.height=5)
for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16)
par(new=TRUE)
}
cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)
Logistic classification minimize a logistic loss in place of the usual $\ell^2$ loss for regression $$ \umin{w} E(w) \eqdef \frac{1}{n} \sum_{i=1}^n L(\dotp{x_i}{w},y_i) $$ where the logistic loss reads $$ L( s,y ) \eqdef \log( 1+\exp(-sy) ) $$ This corresponds to a smooth convex minimization. If $X$ is injective, this is also strictly convex, hence it has a single global minimum.
Compare the binary (ideal) 0-1 loss, the logistic loss and the <https://en.wikipedia.org/wiki/Hinge_loss hinge loss> (the one used for SVM).
options(repr.plot.width=7, repr.plot.height=6)
t = seq(-3, 3, length=255)
plot(t, log(1 + exp(t)), type="l", col=2)
#plot(t, t > 0)
lines(t, t > 0, col=3)
lines(t, pmax(t, 0), col=4)
legend("topleft", legend=c('Binary', 'Logistic', 'Hinge'), col=c(3,2,4), pch="-")
This can be interpreted as a <https://en.wikipedia.org/wiki/Maximum_likelihood_estimation maximum likelihood estimator> when one models the probability of belonging to the two classes for sample $x_i$ as $$ h(x_i) \eqdef (\th(x_i),1-\th(x_i)) \qwhereq \th(s) \eqdef \frac{e^{s}}{1+e^s} = (1+e^{-s})^{-1} $$
Re-writting the energy to minimize $$ E(w) = \Ll(X w,y) \qwhereq \Ll(s,y)= \frac{1}{n} \sum_i L(s_i,y_i), $$ its gradient reads $$ \nabla E(w) = X^\top \nabla \Ll(X w,y) \qwhereq \nabla \Ll(s,y) = \frac{y}{n} \odot \th(-y \odot s), $$ where $\odot$ is the pointwise multiplication operator, i.e. * in Python.
Define the energies.
L = function(s,y){1/n * sum( log(1 + exp(-s * y)))}
E = function(w,X,y){L(X %*% w, y)}
Define their gradients.
theta = function(v){1 / (1 + exp(-v))}
nablaL = function(s, r){ - 1/n * y * theta(-s * y)}
nablaE = function(w,X,y){t(X) %*% nablaL(X %*% w,y)}
Important: in order to improve performance, it is important (especially in low dimension $p$) to add a constant bias term $w_{p+1} \in \RR$, and replace $\dotp{x_i}{w}$ by $ \dotp{x_i}{w} + w_{p+1} $. This is equivalently achieved by adding an extra $(p+1)^{\text{th}}$ dimension equal to 1 to each $x_i$, which we do using a convenient macro.
AddBias = function(X){cbind(X, rep(1, dim(X)[1]))}
With this added bias term, once $w_{\ell=0} \in \RR^{p+1}$ initialized (for instance at $0_{p+1}$),
w = rep(0, p + 1)
dim(w) = c(p+1, 1)
Perform one step of gradient descent reads $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$
tau = .8 # here we are using a fixed tau
w = w - tau * nablaE(w, AddBias(X), y)
If one chooses $$\tau < \tau_{\max} \eqdef \frac{2}{\frac{1}{4}\norm{X}^2},$$ then one is sure that the gradient descent converges.
tau_max = 2/(1/4 * base::norm(AddBias(X), "2")**2)
print(tau_max)
[1] 0.0005154158
Exercise 1
Implement a gradient descent $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$ Monitor the energy decay. Test different step size, and compare with the theory (in particular plot in log domain to illustrate the linear rate). etAR(1); etAR(1);
source("nt_solutions/ml_3_classification/exo1.R")
## Insert your code here.
Generate a 2D grid of points.
q = 201
tx = seq(min(X[,1]), max(X[,1]), length=q)
ty = seq(min(X[,2]),max(X[,2]), length=q)
B = as.vector(meshgrid(ty, tx)$X)
A = as.vector(meshgrid(ty, tx)$Y)
G = matrix(c(A, B), nrow=length(A), ncol=2)
Evaluate class probability associated to weight vectors on this grid.
Theta = theta(AddBias(G) %*% w)
dim(Theta) = c(q, q)
Display the data overlaid on top of the classification probability, this highlight the separating hyperplane $ \enscond{x}{\dotp{w}{x}=0} $.
color = function(x){rev(cm.colors(x))}
image(tx,ty, Theta, xlab="", ylab="", col=color(10), xaxt="n", yaxt="n")
par(new=TRUE)
for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16, xaxt="n", yaxt="n")
par(new=TRUE)
}
cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)
Exercise 2
Test the influence of the separation offset $\omega$ on the result.
source("nt_solutions/ml_3_classification/exo2.R")
## Insert your code here.
Exercise 3
Test logistic classification on a real life dataset. You can look at the Numerical Tour on stochastic gradient descent for an example. Split the data in training and testing to evaluate the classification performance, and check the impact of regularization.
source("nt_solutions/ml_3_classification/exo3.R")
## Insert your code here.
Logistic classification tries to separate the classes using a linear separating hyperplane $ \enscond{x}{\dotp{w}{x}=0}. $
In order to generate a non-linear descision boundary, one can replace the parametric linear model by a non-linear non-parametric model, thanks to kernelization. It is non-parametric in the sense that the number of parameter grows with the number $n$ of sample (while for the basic method, the number of parameter is $p$. This allows in particular to generate decision boundary of arbitrary complexity.
The downside is that the numerical complexity of the method grows (at least) quadratically with $n$.
The good news however is that thanks to the theory of reproducing kernel Hilbert spaces (RKHS), one can still compute this non-linear decision function using (almost) the same numerical algorithm.
Given a kernel $ \kappa(x,z) \in \RR $ defined for $(x,z) \in \RR^p$, the kernelized method replace the linear decision functional $f(x) = \dotp{x}{w}$ by a sum of kernel centered on the samples $$ f_h(x) = \sum_{i=1}^p h_i k(x_i,x) $$ where $h \in \RR^n$ is the unknown vector of weight to find.
When using the linear kernel $\kappa(x,y)=\dotp{x}{y}$, one retrieves the previously studied linear method.
Macro to compute pairwise squared Euclidean distance matrix.
distmat = function(X,Z)
{
dist1 = diag(X %*% t(X))
dist2 = diag(Z %*% t(Z))
n1 = dim(X)[1]
n2 = dim(Z)[1]
out = matrix(0, n1, n2)
for (i in 1:n1)
{
for (j in 1:n2)
{
out[i,j] = dist1[i] + dist2[j]
}
}
out = out - 2 * X %*% t(Z)
return(out)
}
The gaussian kernel is the most well known and used kernel $$ \kappa(x,y) \eqdef e^{-\frac{\norm{x-y}^2}{2\sigma^2}} . $$ The bandwidth parameter $\si>0$ is crucial and controls the locality of the model. It is typically tuned through cross validation.
kappa = function(X,Z,sigma){exp( -distmat(X,Z)/(2*sigma^2))}
We generate synthetic data in 2-D which are not separable by an hyperplane.
n = 1000
p = 2
t = 2 * pi * rand(n/2,1)
R = 2.5
r = R * (1 + .2 * rand(n/2,1)) # radius
X1 = cbind(cos(t) * r, sin(t) * r)
X = rbind(randn(n/2, 2), X1)
y = c(rep(1, n/2), rep(-1, n/2))
Display the classes.
options(repr.plot.width=5, repr.plot.height=5)
for (i in c(-1, 1))
{
I = (y==i)
plot(X[I,1], X[I,2], col=(i + 3), xlim=c(min(X[,1]), max(X[,1])),
ylim=c(min(X[,2]), max(X[,2])), xlab="", ylab="", pch=16)
par(new=TRUE)
}
cols = c(2,4)
legend("topright", legend=c(-1, 1), col=cols, pch=16)
Once avaluated on grid points, the kernel define a matrix $$ K = (\kappa(x_i,x_j))_{i,j=1}^n \in \RR^{n \times n}. $$
sigma = 1
K = kappa(X, X, sigma)
image(K, col=color(10), ylim=c(1, 0))