Credits: Forked from PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas

Here we'll explore **Gaussian Mixture Models**, which is an unsupervised clustering & density estimation technique.

We'll start with our standard set of initial imports

In [1]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
```

We previously saw an example of K-Means, which is a clustering algorithm which is most often fit using an expectation-maximization approach.

Here we'll consider an extension to this which is suitable for both **clustering** and **density estimation**.

For example, imagine we have some one-dimensional data in a particular distribution:

In [2]:

```
np.random.seed(2)
x = np.concatenate([np.random.normal(0, 2, 2000),
np.random.normal(5, 5, 2000),
np.random.normal(3, 0.5, 600)])
plt.hist(x, 80, normed=True)
plt.xlim(-10, 20);
```

Gaussian mixture models will allow us to approximate this density:

In [3]:

```
from sklearn.mixture import GMM
clf = GMM(4, n_iter=500, random_state=3).fit(x)
xpdf = np.linspace(-10, 20, 1000)
density = np.exp(clf.score(xpdf))
plt.hist(x, 80, normed=True, alpha=0.5)
plt.plot(xpdf, density, '-r')
plt.xlim(-10, 20);
```

**mixture of Gaussians**, which we can examine by looking at the `means_`

, `covars_`

, and `weights_`

attributes:

In [4]:

```
clf.means_
```

Out[4]:

array([[ 4.5601338 ], [ 0.0861325 ], [ 3.01890623], [ 6.87627234]])

In [5]:

```
clf.covars_
```

Out[5]:

array([[ 30.34748627], [ 4.30521863], [ 0.19750802], [ 14.78230876]])

In [6]:

```
clf.weights_
```

Out[6]:

array([ 0.27613209, 0.48308463, 0.11612442, 0.12465886])

In [7]:

```
plt.hist(x, 80, normed=True, alpha=0.3)
plt.plot(xpdf, density, '-r')
for i in range(clf.n_components):
pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],
np.sqrt(clf.covars_[i, 0])).pdf(xpdf)
plt.fill(xpdf, pdf, facecolor='gray',
edgecolor='none', alpha=0.3)
plt.xlim(-10, 20);
```

**posterior probability** is used to compute the weighted mean and covariance.
Somewhat surprisingly, this algorithm **provably** converges to the optimum (though the optimum is not necessarily global).

Given a model, we can use one of several means to evaluate how well it fits the data. For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)

In [8]:

```
print(clf.bic(x))
print(clf.aic(x))
```

25696.4735953 25625.7016679

Let's take a look at these as a function of the number of gaussians:

In [9]:

```
n_estimators = np.arange(1, 10)
clfs = [GMM(n, n_iter=1000).fit(x) for n in n_estimators]
bics = [clf.bic(x) for clf in clfs]
aics = [clf.aic(x) for clf in clfs]
plt.plot(n_estimators, bics, label='BIC')
plt.plot(n_estimators, aics, label='AIC')
plt.legend();
```

It appears that for both the AIC and BIC, 4 components is preferred.

GMM is what's known as a **Generative Model**: it's a probabilistic model from which a dataset can be generated.
One thing that generative models can be useful for is **outlier detection**: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where "suitable" is up to your own bias/variance preference) can be labeld outliers.

Let's take a look at this by defining a new dataset with some outliers:

In [10]:

```
np.random.seed(0)
# Add 20 outliers
true_outliers = np.sort(np.random.randint(0, len(x), 20))
y = x.copy()
y[true_outliers] += 50 * np.random.randn(20)
```

In [11]:

```
clf = GMM(4, n_iter=500, random_state=0).fit(y)
xpdf = np.linspace(-10, 20, 1000)
density_noise = np.exp(clf.score(xpdf))
plt.hist(y, 80, normed=True, alpha=0.5)
plt.plot(xpdf, density_noise, '-r')
#plt.xlim(-10, 20);
```

`y`

:

In [12]:

```
log_likelihood = clf.score_samples(y)[0]
plt.plot(y, log_likelihood, '.k');
```

In [13]:

```
detected_outliers = np.where(log_likelihood < -9)[0]
print("true outliers:")
print(true_outliers)
print("\ndetected outliers:")
print(detected_outliers)
```

The algorithm misses a few of these points, which is to be expected (some of the "outliers" actually land in the middle of the distribution!)

Here are the outliers that were missed:

In [14]:

```
set(true_outliers) - set(detected_outliers)
```

Out[14]:

{99, 1033, 1701, 1871, 2222, 2599, 2607, 3264}

And here are the non-outliers which were spuriously labeled outliers:

In [15]:

```
set(detected_outliers) - set(true_outliers)
```

Out[15]:

{3067}

The other main density estimator that you might find useful is *Kernel Density Estimation*, which is available via `sklearn.neighbors.KernelDensity`

. In some ways, this can be thought of as a generalization of GMM where there is a gaussian placed at the location of *every* training point!

In [16]:

```
from sklearn.neighbors import KernelDensity
kde = KernelDensity(0.15).fit(x[:, None])
density_kde = np.exp(kde.score_samples(xpdf[:, None]))
plt.hist(x, 80, normed=True, alpha=0.5)
plt.plot(xpdf, density, '-b', label='GMM')
plt.plot(xpdf, density_kde, '-r', label='KDE')
plt.xlim(-10, 20)
plt.legend();
```

**Generative models** of the data: that is, that is, the model tells us how more data can be created which fits the model.