#!/usr/bin/env python
# coding: utf-8
# In[2]:
from IPython.display import Image
Image('../../../python_for_probability_statistics_and_machine_learning.jpg')
# # Support Vector Machines
#
# Support Vector Machines (SVM) originated from the statistical learning theory
# developed by Vapnik-Chervonenkis. As such, it represents a deep application of
# statistical theory that incorporates the VC dimension concepts we
# discussed in the first section. Let's start by looking at some pictures.
# Consider the two-dimensional classification problem shown in
# [Figure](#fig:svm_001). [Figure](#fig:svm_001) shows two classes (gray and
# white
# circles) that can be separated by any of the lines shown. Specifically, any
# such separating line can be written as the locus of points ($\mathbf{x}$) in
# the two-dimensional plane that satisfy the following,
#
#
#
#
#
# In the two-dimensional plane, the two classes (gray and white circles) are
# easily separated by any one of the lines shown.
#
#
#
#
# $$
# \beta_0 + \boldsymbol{\beta}^T \mathbf{x} = 0
# $$
#
# To classify an arbitrary $\mathbf{x}$ using this line, we just
# compute the sign of $\beta_0+\boldsymbol{\beta}^T \mathbf{x}$ and assign one
# class to the positive sign and the other class to the negative sign. To
# uniquely specify such a separating line (or, hyperplane in a higher-dimensional
# space) we need additional criteria.
#
#
# [Figure](#fig:svm_002) shows the data with two bordering parallel lines that
# form a margin around the central separating line. The *maximal margin
# algorithm* finds the widest margin and the unique separating line. As a
# consequence, the algorithm uncovers the elements in the data that touch the
# margins. These are the *support* elements. The other elements
# away from the border are not relevent to the solution. This reduces
# model variance because the solution is insensitive to the removal of
# elements other than these supporting elements (usually a small minority).
#
#
#
#
#
# The maximal margin algorithm finds the separating line that maximizes the
# margin shown. The elements that touch the margins are the support elements. The
# dotted elements are not relevent to the solution.
#
#
#
#
#
# To see how this works for linearly separable classes, consider a
# training set consisting of $\lbrace (\mathbf{x},y) \rbrace$ where
# $y\in \lbrace -1,1 \rbrace$. For any point $\mathbf{x}_i$, we
# compute the functional margin as $\hat{ \gamma_i }=y_i (\beta_0 +
# \boldsymbol{\beta}^T \mathbf{x}_i)$. Thus, $\hat{\gamma}_i >0$ when
# $\mathbf{x}_i$ is correctly classified. The geometrical margin is
# $\gamma = \hat{\gamma}/\lVert\boldsymbol{\beta}\rVert$. When
# $\mathbf{x}_i$ is correctly classified, the geometrical margin is
# equal to the perpendicular distance from $\mathbf{x}_i$ to the line.
# Let's look see how the maximal margin algorithm works.
#
# Let $M$ be the width of the margin. The maximal margin algorithm is can be
# formulated as a quadratic programming problem. We want to simultaneously
# maximize the margin $M$ while ensuring that all of the data points are
# correctly classified.
#
# $$
# \begin{aligned}
# & \underset{\beta_0,\boldsymbol{\beta},\lVert\boldsymbol{\beta}\rVert=1}{\text{m
# aximize}}
# & & M \\\
# & \text{subject to:}
# & & y_i(\beta_0+\boldsymbol{\beta}^T \mathbf{x}_i) \geq M, \; i = 1, \ldots, N.
# \end{aligned}
# $$
#
# The first line says we want to generate a maximum value for $M$ by
# adjusting $\beta_0$ and $\boldsymbol{\beta}$ while keeping
# $\lVert\boldsymbol{\beta}\rVert=1$. The functional margins for each $i^{th}$
# data element are the constraints to the problem and must be satisfied for every
# proposed solution. In words, the constraints enforce that the elements have to
# be correctly classified and outside of the margin around the separating line.
# With some reformulation, it turns out that
# $M=1/\lVert\boldsymbol{\beta}\rVert$ and this can be put into the following
# standard format,
#
# $$
# \begin{aligned}
# & \underset{\beta_0,\boldsymbol{\beta}}{\text{minimize}}
# & & \lVert\boldsymbol{\beta}\rVert \\\
# & \text{subject to:}
# & & y_i(\beta_0+\boldsymbol{\beta}^T \mathbf{x}_i) \geq 1, \; i = 1, \ldots, N.
# \end{aligned}
# $$
#
# This is a convex optimization problem and can be solved using
# powerful
# methods in that area.
#
# The situation becomes more complex when the two classes are not separable and
# we have to allow some unavoidable mixing between the two classes in the
# solution. This means that the contraints have to modified as in the following,
#
# $$
# y_i(\beta_0+\boldsymbol{\beta}^T \mathbf{x}_i) \geq M(1-\xi_i)
# $$
#
# where the $\xi_i$ are the slack variables and represent the
# proportional amount tha the prediction is on the wrong side of the margin. Thus,
# elements are misclassified when $\xi_i>1$. With these additional variables,
# we have a more general formulation of the convex optimization problem,
#
# $$
# \begin{aligned}
# & \underset{\beta_0,\boldsymbol{\beta}}{\text{minimize}}
# & & \lVert\boldsymbol{\beta}\rVert \\\
# & \text{subject to:}
# & & y_i(\beta_0+\boldsymbol{\beta}^T \mathbf{x}_i) \geq 1-\xi_i, \\\
# & & & \xi_i \geq 0, \sum \xi_i \leq \texttt{constant}, \; i = 1, \ldots, N.
# \end{aligned}
# $$
#
# which can be rewritten in the following equivalent form,
#
#
#
#
# $$
# \begin{equation}
# \begin{aligned}
# & \underset{\beta_0,\boldsymbol{\beta}}{\text{minimize}}
# & & \frac{1}{2}\lVert\boldsymbol{\beta}\rVert + C \sum \xi_i \\\
# & \text{subject to:}
# & & y_i(\beta_0+\boldsymbol{\beta}^T \mathbf{x}_i) \geq 1-\xi_i, \xi_i \geq 0 \;
# i = 1, \ldots, N.
# \end{aligned}
# \end{equation}
# \label{eq:svm} \tag{1}
# $$
#
# Because the $\xi_i$ terms are all positive, the objective
# is to maximize the margin (i.e., minimize $\lVert\boldsymbol{\beta}\rVert$)
# while minimizing the proportional drift of the predictions to the wrong side
# of the margin (i.e., $C \sum \xi_i$). Thus, large values of $C$ shunt
# algorithmic focus towards the correctly classified points near the
# decision boundary and small values focus on further data. The value $C$ is
# a hyperparameter for the SVM.
#
# The good news is that all of these complicated pieces are handled neatly inside
# of Scikit-learn. The following sets up the linear *kernel* for the SVM (more on
# kernels soon),
# In[3]:
from sklearn.datasets import make_blobs
from sklearn.svm import SVC
sv = SVC(kernel='linear')
# We can create some synthetic data using `make_blobs` and then
# fit it to the SVM,
# In[4]:
X,y=make_blobs(n_samples=200, centers=2, n_features=2,
random_state=0,cluster_std=.5)
sv.fit(X,y)
# After fitting, the SVM now has the estimated support vectors and the
# coefficients of the $\boldsymbol{\beta}$ in the `sv.support_vectors_` and
# `sv.coef_` attributes, respectively. [Figure](#fig:svm_003) shows the two
# sample classes (white and gray circles) and the line separating them that was
# found by the maximal margin algorithm. The two parallel dotted lines show the
# margin. The large circles enclose the support vectors, which are the data
# elements that are relevent to the solution. Notice that only these elements
# can touch the edges of the margins.
# In[5]:
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.pylab import subplots
import numpy as np
xi = np.linspace(X[:,0].min(),X[:,0].max(),100)
fig,ax=subplots()
_=ax.scatter(X[:,0],X[:,1],c=y,s=50,cmap='gray',marker='o',alpha=.3)
_=ax.plot(sv.support_vectors_[:,0],sv.support_vectors_[:,1],'ko',markersize=20,alpha=.2)
_=ax.plot(xi,-sv.coef_[0,0]/sv.coef_[0,1]*xi- sv.intercept_/sv.coef_[0,1],'k',lw=3.)
margin = np.linalg.norm(sv.coef_)
_=ax.plot(xi,-sv.coef_[0,0]/sv.coef_[0,1]*xi-(sv.intercept_+margin/2.)/sv.coef_[0,1],'--k',lw=3.)
_=ax.plot(xi,-sv.coef_[0,0]/sv.coef_[0,1]*xi-(sv.intercept_-margin/2.)/sv.coef_[0,1],'--k',lw=3.)
#
#
#
#
# The two class shown (white and gray circles) are linearly separable. The
# maximal margin solution is shown by the dark black line in the middle. The
# dotted lines show the extent of the margin. The large circles indicate the
# support vectors for the maximal margin solution.
#
#
#
# In[6]:
def draw_margins(sv,X,y,ax=None):
sv.fit(X,y)
xi = np.linspace(X[:,0].min(),X[:,0].max(),100)
if ax is None: fig,ax=subplots()
_=ax.scatter(X[:,0],X[:,1],c=y,s=50,cmap='gray',marker='o',alpha=.3)
_=ax.plot(sv.support_vectors_[:,0],sv.support_vectors_[:,1],'ko',markersize=20,alpha=.2)
_=ax.plot(xi,-sv.coef_[0,0]/sv.coef_[0,1]*xi- sv.intercept_/sv.coef_[0,1],'k',lw=3.)
margin = np.linalg.norm(sv.coef_)
_=ax.plot(xi,-sv.coef_[0,0]/sv.coef_[0,1]*xi- (sv.intercept_+margin/2.)/sv.coef_[0,1],'--k',lw=3.)
_=ax.plot(xi,-sv.coef_[0,0]/sv.coef_[0,1]*xi- (sv.intercept_-margin/2.)/sv.coef_[0,1],'--k',lw=3.)
# In[7]:
X, y = make_blobs(n_samples=50, centers=2, n_features=2,
cluster_std=1,random_state=0)
fig,axs = subplots(2,2,sharex=True,sharey=True)
#fig.set_size_inches((12,6))
sv = SVC(kernel='linear',C=.0100)
draw_margins(sv,X,y,ax=axs[0,0])
_=axs[0,0].set_title('C=0.01')
sv = SVC(kernel='linear',C=1)
draw_margins(sv,X,y,ax=axs[0,1])
_=axs[0,1].set_title('C=1')
sv = SVC(kernel='linear',C=100)
draw_margins(sv,X,y,ax=axs[1,0])
_=axs[1,0].set_title('C=100')
sv = SVC(kernel='linear',C=10000)
draw_margins(sv,X,y,ax=axs[1,1])
_=axs[1,1].set_title('C=10000')
# [Figure](#fig:svm_004) shows what happens when the value of $C$ changes.
# Increasing this value emphasizes the $\xi$ part of the objective function in
# Equation [eq:svm](#eq:svm). As shown in the top left panel, a small value for
# $C$ means that
# the algorithm is willing to accept many support vectors at the expense of
# maximizing the margin. That is, the proportional amount that predictions are on
# the wrong side of the margin is more acceptable with smaller $C$. As the value
# of $C$ increases, there are fewer support vectors because the optimization
# process prefers to eliminate support vectors that are far away from the margins
# and accept fewer of these that encroach into the margin. Note that as the value
# of $C$ progresses through this figure, the separating line tilts slightly.
#
#
#
#
#
# The maximal margin algorithm finds the separating line that maximizes the
# margin shown. The elements that touch the margins are the support elements. The
# dotted elements are not relevent to the solution.
#
#
#
#
#
# ## Kernel Tricks
#
# Support Vector Machines provide a powerful method to deal with linear
# separations, but they can also apply to non-linear boundaries by
# exploiting the so-called *kernel trick*. The convex optimization
# formulation of the SVM includes a *dual* formulation that leads to a
# solution that requires only the inner-products of the features. The
# kernel trick is to substitute inner-products by nonlinear kernel
# functions. This can be thought of as mapping the original features
# onto a possibly infinite dimensional space of new features. That is,
# if the data are not linearly separable in two-dimensional space (for
# example) maybe they are separable in three-dimensional space (or
# higher)?
#
# To make this concrete, suppose the original input space is
# $\mathbb{R}^n$ and we want to use a non-linear mapping
# $\psi:\mathbf{x} \mapsto \mathcal{F}$ where $\mathcal{F}$ is an
# inner-product space of higher dimension. The kernel trick is to
# calculate the inner-product in $\mathcal{F}$ using a kernel
# function, $K(\mathbf{x}_i,\mathbf{x}_j) = \langle
# \psi(\mathbf{x}_i),\psi(\mathbf{x}_j)\rangle$. The long way to
# compute this is to first compute $\psi(\mathbf{x})$ and then do the
# inner-product. The kernel-trick way to do it is to use the kernel
# function and avoid computing $\psi$. In other words, the kernel
# function returns what the inner-product in $\mathcal{F}$ would have
# returned if $\psi$ had been applied. For example, to achieve an
# $n^{th}$ polynomial mapping of the input space, we can use
# $\kappa(\mathbf{x}_i,\mathbf{x}_j)=(\mathbf{x}_i^T\mathbf{x}_j+\theta)^n$.
# For example, suppose the input space is $\mathbb{R}^2$ and
# $\mathcal{F}=\mathbb{R}^4$ and we have the following mapping,
#
# $$
# \psi(\mathbf{x}) : (x_0,x_1) \mapsto (x_0^2,x_1^2,x_0 x_1, x_1 x_0)
# $$
#
# The inner product in $\mathcal{F}$ is then,
#
# $$
# \langle \psi(\mathbf{x}),\psi(\mathbf{y}) \rangle = \langle
# \mathbf{x},\mathbf{y} \rangle^2
# $$
#
# In other words, the kernel is the square of the inner
# product in input space. The advantage of using the kernel instead of
# simply enlarging the feature space is computational because you only
# need to compute the kernel on all distinct pairs of the input space.
# The following example should help make this concrete. First we create
# some Sympy variables,
# In[8]:
import sympy as S
x0,x1=S.symbols('x:2',real=True)
y0,y1=S.symbols('y:2',real=True)
# Next, we create the $\psi$ function that maps into $\mathbb{R}^4$
# and the corresponding kernel function,
# In[9]:
psi = lambda x,y: (x**2,y**2,x*y,x*y)
kern = lambda x,y: S.Matrix(x).dot(y)**2
# Notice that the inner product in $\mathbb{R}^4$ is
# equal to the kernel function, which only uses wthe $\mathbb{R}^2$
# variables.
# In[11]:
print(S.Matrix(psi(x0,x1)).dot(psi(y0,y1)))
print(S.expand(kern((x0,x1),(y0,y1)))) # same as above
# **Polynomial Regression Using Kernels.** Recall our favorite
# linear regression problem from the regularization chapter,
#
# $$
# \min_{\boldsymbol{\beta}} \Vert y - \mathbf{X}\boldsymbol{\beta}\Vert^2
# $$
#
# where $\mathbf{X}$ is a $n\times m$ matrix with $m>n$. As
# we discussed, there are multiple solutions to this problem. The
# least-squares solution is the following:
#
# $$
# \boldsymbol{\beta}_{LS}=\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{\text{-1}}\mathbf{
# y}
# $$
#
# Given a new feature vector $\mathbf{x}$, the corresponding estimator
# for $\mathbf{y}$ is the following,
#
# $$
# \hat{\mathbf{y}} = \mathbf{x}^T\boldsymbol{\beta}_{LS}=\mathbf{x}^T\mathbf{X}^T(
# \mathbf{X}\mathbf{X}^T)^{\text{-1}}\mathbf{y}
# $$
#
# Using the kernel trick, the solution can be written more generally as
# the following,
#
# $$
# \hat{\mathbf{y}}=\mathbf{k}(\mathbf{x})^T\mathbf{K}^{\text{-1}}\mathbf{y}
# $$
#
# where the $n\times n$ kernel matrix $\mathbf{K}$ replaces
# $\mathbf{X}\mathbf{X}^T$ and where $\mathbf{k}(\mathbf{x})$ is a $n$-vector of
# components $\mathbf{k}(\mathbf{x})=[\kappa(\mathbf{x}_i,\mathbf{x})]$ and where
# $\mathbf{K}_{i,j}=\kappa(\mathbf{x}_i,\mathbf{x}_j)$ for the kernel function
# $\kappa$. With this more general setup, we can substitute
# $\kappa(\mathbf{x}_i,\mathbf{x}_j)=(\mathbf{x}_i^T\mathbf{x}_j+\theta)^n$ for
# $n^{th}$-order polynomial regression [[bauckhagenumpy]](#bauckhagenumpy). Note
# that ridge
# regression can also be incorporated by inverting $(\mathbf{K}+\alpha
# \mathbf{I})$, which can help stabilize poorly conditioned $\mathbf{K}$ matrices
# with a tunable $\alpha$ hyper-parameter [[bauckhagenumpy]](#bauckhagenumpy).
#
# For some kernels, the enlarged $\mathcal{F}$ space is infinite-dimensional.
# Mercer's conditions provide technical restrictions on the kernel functions.
# Powerful, well-studied kernels have been implemented in Scikit-learn. The
# advantage of kernel functions may evaporate for when $n\rightarrow m$ in which
# case using the $\psi$ functions instead can be more practicable.
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# In[12]:
from matplotlib.pylab import cm
xi = np.linspace(X[:,0].min(),X[:,0].max(),100)
yi = np.linspace(X[:,1].min(),X[:,1].max(),100)
fig,ax=subplots()
_=ax.scatter(X[:,0],X[:,1],c=y,s=50,cmap='gray',marker='o',alpha=.3)
Xi,Yi = np.meshgrid(xi,yi)
Zi=sv.predict(np.c_[Xi.ravel(),Yi.ravel()]).reshape(Xi.shape)
_=ax.contourf(Xi,Yi,Zi,cmap=cm.Paired,alpha=0.2);