# 1. Introduction¶

### Luis Martí¶

#### Instituto de Computação¶

$\newcommand{\vec}[1]{\boldsymbol{#1}}$

• The slides are programmed as a Jupyter/IPython notebook.
• Feel free to try them and experiment on your own by launching the notebooks.

• You can run the notebook online:

If you are using nbviewer you can change to slides mode by clicking on the icon:

## Machine learning¶

Programs with parameters that automatically adjust by adapting to previously seen data.

• Machine learning can be considered a subfield of artificial intelligence...
• ...since those algorithms can be seen as building blocks to make computers learn to behave more intelligently.
• Generalize instead of that just storing and retrieving data items like a database system would do.

## Machine learning: A modern alchemy¶

• Data is more abundant -and least expensive- than knowledge.
• Professionals from various areas of industry work on a particular philosopher's stone:

### Turn data into knowledge!

Work in machine learning:
https://stackoverflow.com/insights/survey/2017#salary
Alchemic treatise of [Ramon Llull](https://en.wikipedia.org/wiki/Ramon_Llull).

Intelligent systems find patterns and discover relations that are latent in large volumes of data.

Features of intelligent systems:

• Learning
• Flexibility and robustness
• Provide explanations
• Discovery/creativity

### Learning¶

Learning is the act of acquiring new, or modifying and reinforcing, existing knowledge, behaviors, skills, values, or preferences and may involve synthesizing different types of information.

• Construction and study of systems that can learn from data.

• The environment/real world is in constant change.
• The capacity to adapt implies to be able to modify what has been learn in order to cope with those modifications.
• There are many real-world cases:
• Changes in economy
• Wear of mechanic parts of a robot
• In many instances the capacity to adapt is essential to solve the problem $\rightarrow$ continuous learning.

### Flexibility and robustness¶

• It is required to have a robust and consistent system.
• Similar inputs should generate consistent outputs.
• Self-organization
• 'Classical' approaches based on Boolean algebra and logic have limited flexibility.

### Explanations¶

• Explanations are necessary to validate and find directions for improvement.
• It is not enough to automate the decision making process.
• In many context explanations are necessary: medicine, credit evaluation, etc.
• They are important if a human expert takes part of the decission loop.
• Machine learning can become a research tool.

### Discovery/creativity¶

• Capacity of discovering processes and/or relations previously unknown.
• Creation of solution and artifacts.

Example: Evolving cars with genetic algorithms: http://www.boxcar2d.com/.

More formally, the machine learning can be described as:

• Having a process $\vec{F}:\mathcal{D}\rightarrow\mathcal{I}$ that transforms a given $\vec{x}\in\mathcal{D}$ in a $\vec{y}$.
• Construct on a dataset $\Psi=\left\{\left<\vec{x}_i,\vec{y}_i\right>\right\}$ with $i=1,\ldots,N$.
• Each $\left<\vec{x}_i,\vec{y}_i\right>$ represents an input and its corresponding expected output: $\vec{y}_i=\vec{F}\left(\vec{x}_i\right)$.
• Optimize a model $\mathcal{M}(\vec{x};\vec{\theta})$ by adjusting its parameters $\vec{\theta}$.
• Make $\mathcal{M}()$ to be as similar as possible to $\vec{F}()$ by optimizing one or more error (loss) functions.

Note: Generally, $\mathcal{D}\subseteq\mathbb{R}^n$; the definition of $\mathcal{I}$ depends on the problem.

## Classes of machine learning problems¶

• Classification: $\vec{F}: \mathcal{D}\rightarrow\left\{1,\ldots, k\right\}$; $\vec{F}(\cdot)$ defines 'categories' or 'classes' labels.
• Regression: $\vec{F}: \mathbb{R}^n\rightarrow\mathbb{R}$; it is necessary to predict a real-valued output instead of categories.
• Density estimation: predicit a function $p_\mathrm{model}: \mathbb{R}^n\rightarrow\mathbb{R}$, where $p_\mathrm{model}(\vec{x})$ can be interpreted as a probability density function on the set that the examples were drawn from.
• Clustering: group a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters).
• Synthesis: generate new examples that are similar to those in the training data.

Many more: times-series analysis, anomaly detection, imputation, transcription, etc.

## Supervised learning¶

• Sometimes we can observe the pairs $\left<\vec{x}_i,\vec{y}_i\right>$:
• We can use the $\vec{y}_i$'s to provide a scalar feedback on how good is the model $\mathcal{M}(\vec{x};\vec{\theta})$.
• That feed back is known as the loss function.
• Modify parameters $\vec{\theta}$ as to improve $\mathcal{M}(\vec{x};\vec{\theta})$ $\rightarrow$ learning.

An example of a supervised problem (regression)

In [1]:
import random
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# plt.rc('text', usetex=True); plt.rc('font', family='serif')
# plt.rc('text.latex', preamble='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}')

# numpy - pretty matrix
np.set_printoptions(precision=3, threshold=1000, edgeitems=5, linewidth=80, suppress=True)

import seaborn
seaborn.set(style='whitegrid')
seaborn.set_context('talk')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
# Fixed seed to make the results replicable - remove in real life!
random.seed(42)

In [4]:
x = np.arange(100)


Let's suppose that we have a phenomenon such that $$y_\text{real} = \sin\left(\frac{\pi x}{50}\right)\,.$$

In [5]:
y_real = np.sin(x*np.pi/50)


Introducing some uniform random noise to simulate measurement noise:

In [6]:
y_measured = y_real + (np.random.rand(100) - 0.5)/1

In [7]:
plt.scatter(x,y_measured, marker='.', color='b', label='measured')
plt.plot(x,y_real, color='r', label='real')
plt.xlabel('x'); plt.ylabel('y'); plt.legend(frameon=True);


We can now learn from the dataset $\Psi=\left\{x, y_\text{measured}\right\}$.

In [8]:
from sklearn.svm import SVR

In [9]:
clf = SVR() # using default parameters
clf.fit(x.reshape(-1, 1), y_measured)

Out[9]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

We can now see how our SVR models the data.

In [10]:
y_pred = clf.predict(x.reshape(-1, 1))

In [11]:
plt.scatter(x, y_measured, marker='.', color='blue', label='measured')
plt.plot(x, y_pred, 'g--', label='predicted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);


We observe for the first time an important negative phenomenon: overfitting.

We will be dedicating part of the course to the methods that we have for control overfitting.

In [12]:
clf = SVR(C=1e3, gamma=0.0001)
clf.fit(x.reshape(-1, 1), y_measured)

Out[12]:
SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0001,
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [13]:
y_pred_ok = clf.predict(x.reshape(-1, 1))

In [14]:
plt.scatter(x, y_measured, marker='.', color='b', label='measured')
plt.plot(x, y_pred, 'g--', label='overfitted')
plt.plot(x, y_pred_ok, 'm-', label='not overfitted')
plt.xlabel('X'); plt.ylabel('y'); plt.legend(frameon=True);


## Unsupervised learning¶

In some cases we can just observe a series of items or values, e.g., $\Psi=\left\{\vec{x}_i\right\}$:

• It is necessary to find the hidden structure of unlabeled data.
• We need a measure of correctness of the model that does not requires an expected outcome.
• Although, at first glance, it may look a bit awkward, this type of problem is very common.

• Related to anomaly detection, clustering, etc.

### An unsupervised learning example: Clustering¶

Let's generate a dataset that is composed by three groups or clusters of elements, $\vec{x}\in\mathbb{R}^2$.

In [15]:
x_1 = np.random.randn(30,2) + (5,5)
x_2 = np.random.randn(30,2) + (10,0)
x_3 = np.random.randn(30,2) + (0,2)

In [16]:
plt.scatter(x_1[:,0], x_1[:,1], c='red', label='Cluster 1', alpha =0.74)
plt.scatter(x_2[:,0], x_2[:,1], c='blue', label='Cluster 2', alpha =0.74)
plt.scatter(x_3[:,0], x_3[:,1], c='green', label='Cluster 3', alpha =0.74)
plt.legend(frameon=True); plt.xlabel('$x_1$'); plt.ylabel('$x_2$');
plt.title('Three datasets');


Preparing the training dataset.

In [17]:
x = np.concatenate(( x_1, x_2, x_3), axis=0)
x.shape

Out[17]:
(90, 2)
In [18]:
plt.scatter(x[:,0], x[:,1], c='m', alpha =0.74)
plt.title('Training dataset');


We can now try to learn what clusters are in the dataset. We are going to use the $k$-means clustering algorithm.

In [19]:
from sklearn.cluster import KMeans

In [20]:
clus = KMeans(n_clusters=3)
clus.fit(x)

Out[20]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
In [21]:
labels_pred = clus.predict(x)
print(labels_pred)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 0 2 2 2]

In [22]:
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' +str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True);
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Clusters predicted');


Needing to set the number of clusters can lead to problems.

In [23]:
clus = KMeans(n_clusters=10)
clus.fit(x)
labels_pred = clus.predict(x)

In [24]:
cm=iter(plt.cm.Set1(np.linspace(0,1,len(np.unique(labels_pred)))))
for label in np.unique(labels_pred):
plt.scatter(x[labels_pred==label][:,0], x[labels_pred==label][:,1],
c=next(cm), alpha =0.74, label='Pred. cluster ' + str(label+1))
plt.legend(loc='upper right', bbox_to_anchor=(1.45,1), frameon=True)
plt.xlabel('$x_1$'); plt.ylabel('$x_2$'); plt.title('Ten clusters predicted');


## Semi-supervised learning:¶

• Obtaining a supervised learning dataset can be expensive.
• Some times it can be complemented with a "cheaper" unsupervised learning dataset.
• What if we first learn as much as possible from unlabeled data and then use the labeled dataset.

## Reinforcement learning¶

• Inspired by behaviorist psychology;
• How to take actions in an environment so as to maximize some notion of cumulative reward?
• Differs from standard supervised learning in that correct input/output pairs are never presented,
• ...nor sub-optimal actions explicitly corrected.
• Involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

## Components of a machine learning problem/solution¶

• A parametrized family of functions $\mathcal{M}(\vec{x};\theta)$ describing how the learner will behave on new examples.
• What output $\mathcal{M}(\vec{x};\theta)$ will produce given some input $\vec{x}$?
• A loss function $\ell()$ describing what scalar loss $\ell(\hat{\vec{y}}, \vec{y})$ is associated with each supervised example $\langle x, y\rangle$, as a function of the learner's output $\hat{\vec{y}} = \mathcal{M}(\vec{x};\theta)$ and the target output $\vec{y}$.
• Training consists in choosing the parameters $\theta$ given some training examples $\Psi=\left\{\left<\vec{x}_i,\vec{y}_i\right>\right\}$ sampled from an unknown data generating distribution $P(X, Y)$.

## Components of a machine learning problem/solution (II)¶

• Define a training criterion.

• Ideally: to minimize the expected loss sampled from the unknown data generating distribution.
• This is not possible because the expectation makes use of the true underlying $P()$...
• ...but we only have access to a finite number of training examples, $\Psi$.
• A training criterion usually includes an empirical average of the loss over the training set,

$$\min_{\theta}\ \mathbf{E}_{\Psi}[\ell(\mathcal{M}(\vec{x};\vec{\theta}), \vec{y})].$$

## Components of a machine learning problem/solution (III)¶

• Some additional terms (called regularizers) can be added to enforce preferences over the choices of $\vec{\theta}$.
$$\text{min}\ \mathbf{E}_{\Psi}[\ell(f_\theta(\vec{x}), \vec{y})] + R_{1}(\theta)+\cdots+R_{r}(\theta).$$
• An optimization procedure to approximately minimize the training criterion by modifying $\theta$.

## Datasets and evaluation¶

• It is clear now that we need a dataset for training (of fitting or optimizing) the model.
• Training dataset
• We need another dataset to assess progress and compute the training criterion.
• Testing dataset
• As most ML approaches are stochastic and to contrast different approaches we need to have another dataset.
• Validation dataset

This is a cornerstone issue of machine learning and we will be comming back to it.

The machine learning flowchart

from Scikit-learn [Choosing the right estimator](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

# Nature-inspired machine learning¶

• Cellular automata
• Neural computation
• Evolutionary computation
• Swarm intelligence
• Artificial immune systems
• Membrane computing
• Amorphous computing

# Final remarks¶

• Different classes of machine learning problems:
• Classification
• Regression
• Clustering.
• Different classes of learning scenarions:
• Supervised,
• unsupervised,
• semi-supervised, and
• reinforcement learning.
• Model, dataset, loss function, optimization.

# Homework¶

In [25]:
%load_ext version_information
%version_information scipy, numpy, matplotlib

Out[25]:
SoftwareVersion
Python3.6.2 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
IPython6.1.0
OSDarwin 17.0.0 x86_64 i386 64bit
scipy0.19.1
numpy1.13.1
matplotlib2.0.2
Tue Aug 22 09:03:13 2017 -03
In [26]:
# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen