#!/usr/bin/env python
# coding: utf-8

# # In-Depth: Support Vector Machines

# <!--BOOK_INFORMATION-->
# 
# *This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
# 
# *The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# <!--NAVIGATION-->
# < [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) | [Contents](Index.ipynb) | [In-Depth: Decision Trees and Random Forests](05.08-Random-Forests.ipynb) >

# # Support vector machines (SVMs) 
# - a particularly powerful and flexible class of supervised algorithms for both classification and regression.
# 
# In this section, 
# - To develop the intuition behind support vector machines 
# - To learn about their use in classification problems.
# 
# We begin with the standard imports:

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
import seaborn as sns; sns.set()


# ## Motivating Support Vector Machines

# 
# - **Generative classification**
#     - [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb))
#     - we learned a simple model describing the distribution of each underlying class,  
#     - used these generative models to probabilistically determine labels for new points.
# - **Discriminative classification** 
#     - rather than modeling each class, we simply find a line or curve (in two dimensions) or manifold (in multiple dimensions) that divides the classes from each other.
# 
# As an example of this, consider the simple case of a classification task, in which the two classes of points are well separated:

# In[2]:


from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
                  random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');


# A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification.
# - For two dimensional data like that shown here, this is a task we could do by hand.
# - there is more than one possible dividing line that can perfectly discriminate between the two classes!
# 
# We can draw them as follows:

# In[3]:


xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plt.plot([0.6], [2.1], 'x', color='red', markeredgewidth=2, markersize=10)

for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
    plt.plot(xfit, m * xfit + b, '-k')

plt.xlim(-1, 3.5);


# These are three *very* different separators which, nevertheless, perfectly discriminate between these samples.
# - Depending on which you choose, a new data point (e.g., the one marked by the "X" in this plot) will be assigned a different label!
# - Evidently our simple intuition of "drawing a line between classes" is not enough, and we need to think a bit deeper.

# ## Support Vector Machines: Maximizing the *Margin*
# 
# The intuition is this: 
# - rather than simply drawing a zero-width line between the classes, we can draw around each line a *margin* of some width, up to the nearest point.
# 
# Here is an example of how this might look:

# In[4]:


xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')

for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
    yfit = m * xfit + b
    plt.plot(xfit, yfit, '-k')
    plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
                     color='#AAAAAA', alpha=0.4)

plt.xlim(-1, 3.5);


# > ## The line that maximizes this margin 
# 
# is the one we will choose as the optimal model.
# - Support vector machines are an example of such a *maximum margin* estimator.

# ### Fitting a support vector machine
# 
# We will use Scikit-Learn's support vector classifier to train an SVM model on this data.
# - For the time being, we will use a linear kernel and 
# - set the ``C`` parameter to a very large number
#     - we'll discuss the meaning of these in more depth momentarily

# In[3]:


from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)


# To better visualize what's happening here, let's create a quick convenience function that will plot SVM decision boundaries for us:

# In[4]:


def plot_svc_decision_function(model, ax=None, plot_support=True):
    """Plot the decision function for a 2D SVC"""
    if ax is None:
        ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # create grid to evaluate model
    x = np.linspace(xlim[0], xlim[1], 30)
    y = np.linspace(ylim[0], ylim[1], 30)
    Y, X = np.meshgrid(y, x)
    xy = np.vstack([X.ravel(), Y.ravel()]).T
    P = model.decision_function(xy).reshape(X.shape)
    
    # plot decision boundary and margins
    ax.contour(X, Y, P, colors='k',
               levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    
    # plot support vectors
    if plot_support:
        ax.scatter(model.support_vectors_[:, 0],
                   model.support_vectors_[:, 1],
                   s=300, linewidth=1,facecolors = 'none', edgecolor="black");
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)


# In[5]:


plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(model);


# This is the dividing line that maximizes the margin between the two sets of points.
# Notice
# - A few of the training points just touch the margin: 
#     - indicated by the black circles
# - These points are the pivotal elements of this fit, and are known as the **support vectors**, and give the algorithm its name.
# 
# In Scikit-Learn, the identity of these points are stored in the ``support_vectors_`` attribute of the classifier:

# In[6]:


model.support_vectors_


# A key to this classifier's success is that for the fit, **only the position of the support vectors matter**; 
# - any points further from the margin which are on the correct side do not modify the fit!
#     - Technically, this is because these points do not contribute to the loss function used to fit the model, 
#         - so their position and number do not matter so long as they do not cross the margin.
# 
# We can see this, for example, if we plot the model learned from the first 60 points and first 120 points of this dataset:

# In[7]:


def plot_svm(N=10, ax=None):
    X, y = make_blobs(n_samples=400, centers=2,
                      random_state=0, cluster_std=0.60)
    X = X[:N]
    y = y[:N]
    model = SVC(kernel='linear', C=1E10)
    model.fit(X, y)
    
    ax = ax or plt.gca()
    ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    ax.set_xlim(-1, 4)
    ax.set_ylim(-1, 6)
    plot_svc_decision_function(model, ax)


# In[8]:


fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for axi, N in zip(ax, [60, 120]):
    plot_svm(N, axi)
    axi.set_title('N = {0}'.format(N))


# In the right panel, we have doubled the number of training points, but the model has not changed: 
# - the three support vectors from the left panel are still the support vectors from the right panel.
# 
# This insensitivity to the exact behavior of distant points is one of the strengths of the SVM model.

# If you are running this notebook live, you can use IPython's interactive widgets to view this feature of the SVM model interactively:

# In[9]:


from ipywidgets import interact, fixed
interact(plot_svm, N=[10, 400], ax=fixed(None));


# ### Beyond linear boundaries: Kernel SVM
# 
# Where SVM becomes extremely powerful is when it is combined with *kernels*.
# 
# We have seen a version of kernels before, in the basis function regressions of [In Depth: Linear Regression](05.06-Linear-Regression.ipynb).
# - There we projected our data into higher-dimensional space defined by polynomials and Gaussian basis functions, 
# - and thereby were able to fit for nonlinear relationships with a linear classifier.
# 
# To motivate the need for kernels, let's look at some data that is not linearly separable:

# In[13]:


from sklearn.datasets.samples_generator import make_circles
X, y = make_circles(100, factor=.1, noise=.1)

clf = SVC(kernel='linear').fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf, plot_support=False);


# No linear discrimination will *ever* be able to separate this data.
# 
# - But we can draw a lesson from the basis function regressions in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb), and 
# > - think about how we might project the data into a higher dimension such that a linear separator *would* be sufficient.
# 
# For example, one simple projection we could use would be to compute a **radial basis function** centered on the middle clump:

# In[86]:


(X ** 2)[0]


# In[85]:


(X ** 2)[0].sum()


# In[83]:


(X ** 2).sum(1)[0]


# In[15]:


r = np.exp(-(X ** 2).sum(1))


# We can visualize this extra data dimension using a three-dimensional plot
# - if you are running this notebook live, you will be able to use the sliders to rotate the plot:

# In[16]:


from mpl_toolkits import mplot3d
def plot_3D(elev=30, azim=30, X=X, y=y):
    ax = plt.subplot(projection='3d')
    ax.scatter3D(X[:, 0], X[:, 1], r, c=y, s=50, cmap='autumn')
    ax.view_init(elev=elev, azim=azim)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('r')

interact(plot_3D, elev=[-90, 90], azip=(-180, 180),
         X=fixed(X), y=fixed(y));


# We can see that with this additional dimension, 
# - the data becomes trivially linearly separable, 
#     - by drawing a separating plane at, say, *r*=0.7.
# 
# Here we had to choose and carefully tune our projection: 
# > if we had not centered our radial basis function in the right location, we would not have seen such clean, linearly separable results.
# 
# To make such a choice is a problem: we would like to somehow automatically find the best basis functions to use.

# ## Strategy
# 
# To compute a basis function centered at *every* point in the dataset, and let the SVM algorithm sift through the results.
# - This type of basis function transformation is known as a **kernel transformation** 
#     - as it is based on a `similarity relationship` (or `kernel`) between each pair of points.
# 

# 
# ### Problem
# Projecting $N$ points into $N$ dimensions might become very computationally intensive as $N$ grows large.
# - The [*kernel trick*](https://en.wikipedia.org/wiki/Kernel_trick), a fit on kernel-transformed data can be done implicitly
#     - without ever building the full $N$-dimensional representation of the kernel projection!
# - This kernel trick is built into the SVM, and is one of the reasons the method is so powerful.

# In Scikit-Learn, we can apply kernelized SVM simply by changing our linear kernel to an RBF (radial basis function) kernel, using the ``kernel`` model hyperparameter:

# In[17]:


clf = SVC(kernel='rbf', C=1E6)
clf.fit(X, y)


# In[18]:


plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
plot_svc_decision_function(clf)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
            s=300, lw=1, facecolors='none');


# Using this kernelized support vector machine, we learn a suitable nonlinear decision boundary.
# 
# This kernel transformation strategy is used often in machine learning to turn fast linear methods into fast nonlinear methods, especially for models in which the kernel trick can be used.

# ### Tuning the SVM: Softening Margins
# 
# Our discussion thus far has centered around very clean datasets, in which a perfect decision boundary exists.
# 
# But what if your data has some amount of overlap?
# 
# For example, you may have data like this:

# In[20]:


X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=1)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn');


# The SVM implementation has a bit of a fudge-factor which "softens" the margin: 
# - it allows some of the points to creep into the margin if that allows a better fit.
# 
# The hardness of the margin is controlled by a tuning parameter, most often known as $C$.
# - For very large $C$, the margin is hard, and points cannot lie in it.
# - For smaller $C$, the margin is softer, and can grow to encompass some points.
# 
# The plot shown below gives a visual picture of how a changing $C$ parameter affects the final fit, via the softening of the margin:

# In[21]:


X, y = make_blobs(n_samples=100, centers=2,
                  random_state=0, cluster_std=1)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

for axi, C in zip(ax, [50.0, 0.1]):
    model = SVC(kernel='linear', C=C).fit(X, y)
    axi.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
    plot_svc_decision_function(model, axi)
    axi.scatter(model.support_vectors_[:, 0],
                model.support_vectors_[:, 1],
                s=300, lw=1, facecolors='none');
    axi.set_title('C = {0:.1f}'.format(C), size=14)


# The optimal value of the $C$ parameter will depend on your dataset, and should be tuned using cross-validation or a similar procedure (refer back to [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb)).

# ## Example: Face Recognition
# 
# As an example of support vector machines in action, let's take a look at the facial recognition problem.
# - We will use the Labeled Faces in the Wild dataset, which consists of several thousand collated photos of various public figures.
# 
# A fetcher for the dataset is built into Scikit-Learn:

# In[22]:


from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=57)
print(faces.target_names)
print(faces.images.shape)


# Let's plot a few of these faces to see what we're working with:

# In[23]:


fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[],
            xlabel=faces.target_names[faces.target[i]])


# Each image contains [62×47] or nearly 3,000 pixels.
# 
# We could proceed by simply using each pixel value as a feature, 
# - but often it is more effective to use some sort of preprocessor to extract more meaningful features; 
# - here we will use a principal component analysis (see [In Depth: Principal Component Analysis](05.09-Principal-Component-Analysis.ipynb)) to extract 150 fundamental components to feed into our support vector machine classifier.
# 
# We can do this most straightforwardly by packaging the preprocessor and the classifier into a single pipeline:

# In[24]:


from sklearn.svm import SVC
from sklearn.decomposition import RandomizedPCA
from sklearn.pipeline import make_pipeline

pca = RandomizedPCA(n_components=150, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)


# For the sake of testing our classifier output, we will split the data into a training and testing set:

# In[27]:


from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target,
                                                 random_state=42)


# Finally, we can use a grid search cross-validation to explore combinations of parameters.
# Here we will adjust ``C`` (which controls the margin hardness) and ``gamma`` (which controls the size of the radial basis function kernel), and determine the best model:

# In[29]:


import warnings  
warnings.filterwarnings("ignore")  

from sklearn.grid_search import GridSearchCV
param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)

get_ipython().run_line_magic('time', 'grid.fit(Xtrain, ytrain)')
print(grid.best_params_)


# The optimal values fall toward the middle of our grid; if they fell at the edges, we would want to expand the grid to make sure we have found the true optimum.
# 
# Now with this cross-validated model, we can predict the labels for the test data, which the model has not yet seen:

# In[30]:


model = grid.best_estimator_
yfit = model.predict(Xtest)


# Let's take a look at a few of the test images along with their predicted values:

# In[31]:


fig, ax = plt.subplots(4, 6)
for i, axi in enumerate(ax.flat):
    axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
                   color='black' if yfit[i] == ytest[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=14);


# We can get a better sense of our estimator's performance using the classification report, 
# - which lists recovery statistics label by label:

# In[32]:


from sklearn.metrics import classification_report
print(classification_report(ytest, yfit,
                            target_names=faces.target_names))


# We might also display the confusion matrix between these classes:
# 
# This helps us get a sense of which labels are likely to be confused by the estimator.

# In[33]:


from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=faces.target_names,
            yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');


# For a real-world facial recognition task:
# - the photos do not come pre-cropped into nice grids, 
# - the only difference in the facial classification scheme is the feature selection: 
# 
# - you would need to use a more sophisticated algorithm to find the faces, and extract features that are independent of the pixellation.
# - For this kind of application, one good option is to make use of [OpenCV](http://opencv.org), 
#     - it includes pre-trained implementations of state-of-the-art feature extraction tools for images in general and faces in particular.

# ## Support Vector Machine Summary
# 
# We have seen here a brief intuitive introduction to the principals behind support vector machines.
# These methods are a powerful classification method for a number of reasons:
# 
# - Their dependence on relatively few support vectors means that they are very compact models, and take up very little memory.
# - Once the model is trained, the prediction phase is very fast.
# - Because they are affected only by points near the margin, they work well with high-dimensional data—even data with more dimensions than samples, which is a challenging regime for other algorithms.
# - Their integration with kernel methods makes them very versatile, able to adapt to many types of data.

# ## Support Vector Machine Summary
# 
# However, SVMs have several disadvantages as well:
# 
# - The scaling with the number of samples $N$ is $\mathcal{O}[N^3]$ at worst, or $\mathcal{O}[N^2]$ for efficient implementations. 
#     - For large numbers of training samples, this computational cost can be prohibitive.
# - The results are strongly dependent on a suitable choice for the softening parameter $C$. 
#     - This must be carefully chosen via cross-validation, which can be expensive as datasets grow in size.
# - The results do not have a direct probabilistic interpretation. 
#     - This can be estimated via an internal cross-validation (see the ``probability`` parameter of ``SVC``), but this extra estimation is costly.
# 
# With those traits in mind, I generally only turn to SVMs once other simpler, faster, and less tuning-intensive methods have been shown to be insufficient for my needs.
# 
# Nevertheless, if you have the CPU cycles to commit to training and cross-validating an SVM on your data, the method can lead to excellent results.

# <!--NAVIGATION-->
# < [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) | [Contents](Index.ipynb) | [In-Depth: Decision Trees and Random Forests](05.08-Random-Forests.ipynb) >