import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import pandas as pd
import sklearn.neighbors
from sklearn.utils import shuffle
from scipy.spatial import Voronoi, voronoi_plot_2d
def gen_2d_classification_samples(n_samples = 20):
cov = np.diag([2., 2.])
x1 = np.random.multivariate_normal(mean=[0., 0.], cov=cov, size=n_samples)
y1 = np.full(n_samples, 1, dtype=np.int)
x2 = np.random.multivariate_normal(mean=[4., 0.], cov=cov, size=n_samples)
y2 = np.full(n_samples, 2, dtype=np.int)
x3 = np.random.multivariate_normal(mean=[2., 4.], cov=cov, size=n_samples)
y3 = np.full(n_samples, 3, dtype=np.int)
X = np.concatenate([x1, x2, x3])
y = np.concatenate([y1, y2, y3])
df = pd.DataFrame(X, columns=['x1', 'x2'])
df['y'] = y
df = shuffle(df).reset_index(drop=True)
return df
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
def plot_2d_classification_samples(dataframe, model=None, voronoi=False):
plt.figure(figsize=(8, 8))
df = dataframe # make an alias
ERROR_MSG1 = "The `dataframe` parameter should be a Pandas DataFrame having the following columns: ['x1', 'x2', 'y']"
assert df.columns.values.tolist() == ['x1', 'x2', 'y'], ERROR_MSG1
ERROR_MSG2 = "The `dataframe` parameter should be a Pandas DataFrame having the following labels (in column 'y'): [1, 2, 3]"
labels = pd.unique(df.y).tolist()
labels.sort()
assert labels == [1, 2, 3] or labels == [1, 3], ERROR_MSG2
if model is not None:
if voronoi:
# Compute the Voronoi cells
vor = Voronoi(df[['x1', 'x2']])
# Plot the Voronoi diagram
fig = voronoi_plot_2d(vor, show_vertices=False, show_points=False)
fig.set_size_inches(8, 8)
# Compute the model's decision boundaries
h = .02 # step size in the mesh
x_min, x_max = df.x1.min() - 1, df.x1.max() + 1
y_min, y_max = df.x2.min() - 1, df.x2.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the model's decision boundaries
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=0.5)
# Plot also the training points
plt.scatter(df.x1, df.x2, c=df.y, cmap=cmap_bold, edgecolor='k', s=30)
plt.xlabel(r"$x_1$", fontsize=16)
plt.ylabel(r"$x_2$", fontsize=16)
Today you will implement one of the simplest (but quite powerful) machine learning algorithm: the Nearest Neighbor algorithm and its extension the k-Nearest Neighbors algorithm (or kNN). Both can be used for classification and regression tasks.
Considering a dataset $\mathcal{D}=\{(\mathcal{x}_i, y_i)_{i=1,\dots,n}\}$ of $n$ labeled examples, the Nearest Neighbor model assigns an input vector $\boldsymbol{x}$ to the label $y_{\arg\min_{i=1,\dots, n}d(x, x_i)}$ of its closest neighbor in $\mathcal{D}$.
The distance function $d$ is used to define what is the closest neighbor. This can be any metric measure but the Minkowski distance (especially the classical Euclidian distance $d_2$) is the most common choice. It is defined as follow:
$$d_q: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R}$$$$\boldsymbol{u}, \boldsymbol{v} \mapsto ||\boldsymbol{u} - \boldsymbol{v}||_q = \left( \sum_{j=1}^p |u_j - v_j|^q \right)^{1/q}$$When $q=2$, $d_q$ is the Euclidian distance
$$d_2(\boldsymbol{u}, \boldsymbol{v}) = \sqrt{\sum_{j=1}^{p} (u_j - v_j)^2}$$When $q=1$, $d_q$ is the Manhattan distance
$$d_1(\boldsymbol{u}, \boldsymbol{v}) = \sum_{j=1}^{p} |u_j - v_j|$$When $q=\infty$, $d_q$ is the Tchebychev distance
$$d_{\infty}(\boldsymbol{u}, \boldsymbol{v}) = \max_{j=1,\dots,p} |u_j - v_j|$$We consider the following dataset (where 'x1' and 'x2' are examples features and where 'y' is the examples label):
data = [[0, 0, 1],
[0, 1, 1],
[1, 1, 2],
[1, 0, 3]]
df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df
plot_2d_classification_samples(df)
What label will be predicted by the Nearest Neighbor algorithm for the point $x = \pmatrix{0 \\ 0.5}$ ?
...
If you have $n$ examples in $p$ dimensions in the dataset, what it the training error of the Nearest Neighbor algorithm (in classification) ? Why ?
...
'1' (the figure for the "red" class).
0 because the nearest neighbor of each example $x_i$ in the training dataset is the example $x_i$ itself. Thus there is no classification error for these examples.
Consider this new dataset (where 'volume (mL)' and 'caffeine (g)' are the examples features and where 'drink' is the label):
data = [[250, 0.025, 'tea'],
[100, 0.01, 'tea'],
[125, 0.05, 'coffe'],
[250, 0.1, 'coffe']]
df = pd.DataFrame(data, columns=['volume (mL)', 'caffeine (g)', 'drink'])
df
Use the Nearest Neighbor method to predict the label of a 125mL drink having 0.015g of caffeine.
...
What is wrong with this prediction ? How to solve this problem ?
...
The nearest example is $(125 ; 0.050)$.
Thus the label of a 125mL drink having 0.015g of caffeine is "coffe".
Actually, the predicted drink is more probably a "tea" (considering the caffeine feature).
The prediction is made on the volume and the caffeine is not considered because these two variables are not normalized!
Thus example's value of these two variables have to be normalized to be comparable (i.e. transformed to be contained in the same range).
Let's play with the Scikit Learn implementation of the Nearest Neighbor algorithm. The official documentation is there: https://scikit-learn.org/stable/modules/neighbors.html
We begin with a "toy" classification problem.
Use the gen_2d_classification_samples()
function (defined above) to generate a dataset.
df = gen_2d_classification_samples(n_samples=20)
df.head()
Here, examples are defined in $\mathbb{R}^2$ (features are stored in columns 'x1' and 'x2'). Example's labels are defined in the 'y' column.
The 'y' column contains three possible labels: "1", "2" and "3" respectively represented by the red, green and blue colors in the following figure.
plot_2d_classification_samples(df)
Thus this toy problem is a multiclass classification problem.
Once the dataset is ready, let's make the classifier and train it with the following code:
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
model.fit(X=df[['x1', 'x2']], y=df['y'])
Use the model.predict()
function to guess the class of the following points:
Is the training step (model.fit()
function) longer to execute than the prediction step (model.predict()
function) ? Why ?
...
The next cell shows the decision boundary of the model. Explain what is a decision boundary in classification.
...
plot_2d_classification_samples(df, model=model)
The next cell generates the Voronoï diagram of the dataset. What does this figure illustrates about the Nearest Neighbor method?
The Voronoï diagram makes a partition of the feature space $\mathcal{X}$. Each partition is a cell. What do cells represent?
...
plot_2d_classification_samples(df, model=model, voronoi=True);
model.predict([[-2., 2.],
[2., 6.],
[6., 0.]])
i.e. $y_{p1} = 1$ (red), $y_{p2} = 3$ (blue), $y_{p3} = 2$ (green).
No. model.fit()
does nothing in Nearest Neighbor models as they have nothing to "learn" (i.e. these models have no parameter to optimize).
The model.predict()
function does all the job: compute distances with examples in the dataset and find the closest example to the predicted one. This function may be long to execute if there are many examples in the dataset.
This is a subspace of the feature space $\mathcal{X}$ that partitions it into sets that defines the affiliations to each class:
The decision boundaries are the boundaries of these areas.
There is one "cell" per example of $\mathcal{D}$. Each cell defines the area of influence of these examples (called seeds): any predicted point $\boldsymbol{x}$ in a cell will receive the same label than its seed example.
After the "toy" classification problem, let's work on a toy regression problem.
The next cell generates a dataset (where 'x' is the feature and 'y' the label to predict).
N_SAMPLES = 40
x = np.random.uniform(low=-10., high=10., size=N_SAMPLES)
y = 2. * x + 3. + np.random.normal(scale=3., size=x.shape)
df = pd.DataFrame(np.array([x, y]).T, columns=['x', 'y'])
#df.plot(x='x', y='y', style='o-')
df.plot.scatter(x='x', y='y');
Once the dataset is ready, let's make the regressor and train it with the following code:
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=1)
model.fit(df[['x']], df['y'])
Use the model.predict()
function to guess the class of the following points:
model.predict([[-2.],
[2.],
[6.]])
A random noise is added to examples label thus you will have different (but close) results.
x_pred = np.arange(-10, 10, 0.1).reshape(-1, 1)
y_pred = model.predict(x_pred)
df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])
ax = df.plot.scatter(x='x', y='y')
df_pred.plot(x='x', y='y', style='r--', ax=ax);
Do you think this model has good performances in generalization ? Why ?
...
This model learn the trend of the data (this is desired) but it also learn the noise contained in the dataset (this is not desired).
Thus it will have poor results with new observations. In other words, this model does not generalize well on new data.
It won't make good prediction on unknown data.
The Nearest Neighbor method is very sensitive to noise: if an example in $\mathcal{D}$ is wrongly labeled or positioned, all points in its Voronoï cell will be wrong too. The k Neareast Neighbor fix this weakness by considering for each prediction the label of several neighbors instead of just one.
Considering a dataset $\mathcal{D}=\{(\mathcal{x}_i, y_i)_{i=1,\dots,n}\}$ of $n$ labeled examples and a meta parameter $k \in \mathbb{N}*$, the $k$ Nearest Neighbors model assigns an input vector $\boldsymbol{x}$ to the label $y_{\arg\min_{i=1,\dots, n}d(x, x_i)}$ of its $k$ closest neighbor in $\mathcal{D}$. Let's write $\mathcal{N}_k(\boldsymbol{x})$ the set of the $k$ nearest neighbors of $\boldsymbol{x}$ in $\mathcal{D}$.
We consider the following dataset (where x1
and x2
are the example features and where y
is the example label):
data = [[1, 2, '+'],
[2, 1, '+'],
[2, 2, '-'],
[2, 3, '+'],
[3, 1, '-'],
[3, 2, '+']]
df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df
Draw this dataset on a sheet of paper.
Draw the decision boundary of a Nearest Neighbor model (i.e. 1NN).
Draw the decision boundary of a 3 Nearest Neighbor model (i.e. 3NN)
How many errors these two classifiers make on the training dataset ?
What label these two classifiers predict for the point $x = \pmatrix{4 \\ 0.5}$ ?
ax = df.loc[df.y == '+'].plot.scatter(x='x1', y='x2', color='red', label="+", figsize=(8,8))
df.loc[df.y == '-'].plot.scatter(x='x1', y='x2', color='blue', label="-", ax=ax);
df.loc[df.y == '-', "y"] = 3
df.loc[df.y == '+', "y"] = 1
df
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
model.fit(df[['x1', 'x2']], df['y'])
plot_2d_classification_samples(df, model=model)
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3)
model.fit(df[['x1', 'x2']], df['y'])
plot_2d_classification_samples(df, model=model)
The nearest point is $\pmatrix{4 \\ 0.5}$. Thus for $k = 1$, the predicted label is "-" (blue).
The next two nearest points are $\pmatrix{2 \\ 1}$ and $\pmatrix{3 \\ 2}$. Thus for $k = 3$, the predicted label is "+" (red).
First we make the dataset.
df = gen_2d_classification_samples()
plot_2d_classification_samples(df)
Then we make the classifier, train it and plot the decision boundaries:
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
model.fit(df[['x1', 'x2']], df['y'])
plot_2d_classification_samples(df, model=model)
The n_neighbors
parameter provided to the model's constructor KNeighborsClassifier
sets the number of neighbors to consider for each prediction (i.e. n_neighbors
this is the '$k$' of kNN).
Change the value of this parameter and observe what happen. What is the influence of the number of neighbors on the boundaries ?
...
When you face a very noised dataset (wrong labels, misplaced points, ...), should you increase or decrease $k$ ?
...
Is the Voronoi diagram useful for the kNN case (i.e. when $k>1$) ?
...
Plot the decision boundary with $k=2$ and describe what append in case of equal vote?
...
Add the weights = "distance"
parameter in KNeighborsClassifier
's constructor. What changes can you observe on the decision boundary? Explain how labels are computed with this new parameter.
...
Large $k$ produce "smoother" boundaries. In general, a model with e.g. $k=5$ is less influenced by the noise contained in the dataset than a model with $k < 5$ i.e. it generalize better.
If $k \geq n$ (with $n$ the number of elements in the dataset $\mathcal{D}$) then all predicted points have the label of the most represented class (see question 4 in case of equals representations).
It's generally a good idea yes.
No.
An arbitrary choice is made.
In Scikit Learn:
With this parameter enabled, the influence of the k nearest neighbors is weighted by their proximity to the point $\boldsymbol{x}$ to predict.
It have the disadvantage to make models more sensitive to the noise contained in $\mathcal{D}$.
First we make the dataset.
N_SAMPLES = 40
x = np.random.uniform(low=-10., high=10., size=N_SAMPLES)
y = 2. * x + 3. + np.random.normal(scale=3., size=x.shape)
df = pd.DataFrame(np.array([x, y]).T, columns=['x', 'y'])
#df.plot(x='x', y='y', style='o-')
df.plot.scatter(x='x', y='y');
Then we make the classifier, train it and plot the decision boundaries:
n_neighbors = 10
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors)
model.fit(df[['x']], df['y'])
x_pred = np.arange(-10, 10, 1).reshape(-1, 1)
y_pred = model.predict(x_pred)
df_pred = pd.DataFrame(np.array([x_pred.flatten(), y_pred.flatten()]).T, columns=['x', 'y'])
ax = df.plot.scatter(x='x', y='y')
df_pred.plot(x='x', y='y', style='r--', ax=ax);
The n_neighbors
parameter provided to the model's constructor KNeighborsClassifier
sets the number of neighbors to consider for each prediction (i.e. n_neighbors
this is the '$k$' of kNN).
Change the value of this parameter and observe what happen. What is the influence of the number of neighbors on the decision function ?
...
When you face a very noised dataset (wrong labels, misplaced points, ...), should you increase or decrease $k$ ?
...
In one hand, a model with a small $k$ (e.g. $k=2$) is more influenced by the noise contained in the dataset i.e. it has poorer generalization performance (i.e. poor results on unknown data): this is an overfitted model.
In the other hand, if $k$ is too large compared to the number of available examples in $\mathcal{D}$ (e.g. if $k=20$ here) it won't be capable to fit local trends and will have poor results too: this is an underfitted model.
Here, $k=10$ seems to be a good compromise.
There is a tradeoff to find. But a model with a small $k$ is more influenced by the noise contained in the dataset.
Solve the Titanic problem with the k Nearest Neighbors method. Reuse the code of the first lab session.
Write your own implementation for the k Nearest Neighbor algorithm.
Write a knn()
function that takes two arguments:
This function should return the sequence of predicted labels.