If you are using nbviewer you can change to slides mode by clicking on the icon:
Programs with parameters that automatically adjust by adapting to previously seen data.
Intelligent systems find patterns and discover relations that are latent in large volumes of data.
Features of intelligent systems:
Learning is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information.
Example: Evolving cars with genetic algorithms: http://www.boxcar2d.com/.
More formally, the machine learning can be described as: $\renewcommand{\vec}[1]{\mathbf{#1}}$
Note: Generally, $\mathcal{D}\subseteq\mathbb{R}^n$; the definition of $\mathcal{I}$ depends on the problem.
Many more: times-series analysis, anomaly detection, imputation, transcription, etc.
An example of a supervised problem (regression)
import random
import numpy as np
import matplotlib.pyplot as plt
plt.rc('text', usetex=True); plt.rc('font', family='serif')
plt.rc('text.latex', preamble='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}')
# numpy - pretty matrix
np.set_printoptions(precision=3, threshold=1000, edgeitems=5, linewidth=80, suppress=True)
import seaborn
seaborn.set(style='whitegrid'); seaborn.set_context('talk')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Fixed seed to make the results replicable - remove in real life!
random.seed(42)
x = np.arange(100)
y_real = np.sin(x/100*2*np.pi)
y_measured = y_real + (np.random.rand(100) - 0.5)/1 # simulating noise
plt.scatter(x,y_measured, marker='.', color='b', label='measured')
plt.plot(x,y_real, color='magenta', label='real')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(frameon=True);
We can now learn from the dataset $\Psi=\left\{x, y_\text{measured}\right\}$.
scikit-learn
.Training (adjusting) SVR
from sklearn.svm import SVR
clf = SVR() # using default parameters
clf.fit(x.reshape(-1, 1), y_measured)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto', kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
We can now see how our SVR models the data.
y_pred = clf.predict(x.reshape(-1, 1))
plt.scatter(x, y_measured, marker='.', color='blue', label='measured')
plt.plot(x, y_pred, color='green', label='predicted')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(frameon=True);
We observe for the first time an important negative phenomenon: overfitting.
In some cases we can just observe a series of items or values, e.g., $\Psi=\left\{\vec{x}_i\right\}$:
It is necessary to find the hidden structure of unlabeled data.
We need a measure of correctness of the model that does not requires an expected outcome.
Although, at first glance, it may look a bit awkward, this type of problem is very common.
Let's generate a dataset that is composed by three groups or clusters of elements, $\vec{x}\in\mathbb{R}^2$.
x_1 = np.random.randn(30,2) + (5,5)
x_2 = np.random.randn(30,2) + (10,0)
x_3 = np.random.randn(30,2) + (0,2)
plt.scatter(x_1[:,0], x_1[:,1], c='red', label='Cluster 1')
plt.scatter(x_2[:,0], x_2[:,1], c='blue', label='Cluster 2')
plt.scatter(x_3[:,0], x_3[:,1], c='green', label='Cluster 3')
plt.legend(frameon=True); plt.xlabel('$x_1$'); plt.ylabel('$x_2$');
plt.title('Three datasets');
Preparing the training dataset.
x = np.concatenate(( x_1, x_2, x_3), axis=0)
x.shape
(90, 2)
plt.scatter(x[:,0], x[:,1], c='m')
plt.title('Training dataset');
We can now try to learn what clusters are in the dataset. We are going to use the $k$-means clustering algorithm.
from sklearn.cluster import KMeans
clus = KMeans(n_clusters=3)
clus.fit(x)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
labels_pred = clus.predict(x)
print(labels_pred)
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), label='Pred. cluster ' +str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True);
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Clusters predicted');
Needing to set the number of clusters can lead to problems.
clus = KMeans(n_clusters=10)
clus.fit(x)
labels_pred = clus.predict(x)
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), label='Pred. cluster ' + str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True)
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Ten clusters predicted');
This is a cornerstone issue of machine learning and we will be comming back to it.
The machine learning flowchart
%load_ext version_information
%version_information scipy, numpy, matplotlib
Software | Version |
---|---|
Python | 3.6.1 64bit [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] |
IPython | 5.3.0 |
OS | Darwin 16.5.0 x86_64 i386 64bit |
scipy | 0.19.0 |
numpy | 1.12.1 |
matplotlib | 2.0.0 |
Sat Apr 08 17:01:36 2017 -03 |
# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen
HTML(urlopen('https://raw.githubusercontent.com/lmarti/jupyter_custom/master/custom.include').read().decode('utf-8'))