Sebastian Raschka, 2016
mlxtend
, a library of extension and helper modules for Python's data analysis and machine learning libraries
View this page in jupyter nbviewer
%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v -p matplotlib,numpy,scipy,mlxtend
Sebastian Raschka last updated: 2016-06-26 CPython 3.5.1 IPython 4.2.0 matplotlib 1.5.1 numpy 1.11.0 scipy 0.17.1 mlxtend 0.4.2.dev0
%matplotlib inline
A implementation of k-means clustering.
from mlxtend.cluster import Kmeans
Clustering falls into the category of unsupervised learning, a subfield of machine learning where the ground truth labels are not available to us in real-world applications. In clustering, our goal is to group samples by similarity (in k-means: Euclidean distance).
The k-means algorithms can be summarized as follows:
import matplotlib.pyplot as plt
from mlxtend.data import three_blobs_data
X, y = three_blobs_data()
plt.scatter(X[:, 0], X[:, 1], c='white')
plt.show()
from mlxtend.cluster import Kmeans
km = Kmeans(k=3,
max_iter=50,
random_seed=1,
print_progress=3)
km.fit(X)
print('Iterations until convergence:', km.iterations_)
print('Final centroids:\n', km.centroids_)
Iteration: 2/50 | Elapsed: 00:00:00 | ETA: 00:00:00
Iterations until convergence: 2 Final centroids: [[-1.5947298 2.92236966] [ 2.06521743 0.96137409] [ 0.9329651 4.35420713]]
y_clust = km.predict(X)
plt.scatter(X[y_clust == 0, 0],
X[y_clust == 0, 1],
s=50,
c='lightgreen',
marker='s',
label='cluster 1')
plt.scatter(X[y_clust == 1,0],
X[y_clust == 1,1],
s=50,
c='orange',
marker='o',
label='cluster 2')
plt.scatter(X[y_clust == 2,0],
X[y_clust == 2,1],
s=50,
c='lightblue',
marker='v',
label='cluster 3')
plt.scatter(km.centroids_[:,0],
km.centroids_[:,1],
s=250,
marker='*',
c='red',
label='centroids')
plt.legend(loc='lower left',
scatterpoints=1)
plt.grid()
plt.show()
with open('../../api_modules/mlxtend.cluster/Kmeans.md', 'r') as f:
print(f.read())
## Kmeans *Kmeans(k, max_iter=10, convergence_tolerance=1e-05, random_seed=None, print_progress=0)* K-means clustering class. Added in 0.4.1dev **Parameters** - `k` : int Number of clusters - `max_iter` : int (default: 10) Number of iterations during cluster assignment. Cluster re-assignment stops automatically when the algorithm converged. - `convergence_tolerance` : float (default: 1e-05) Compares current centroids with centroids of the previous iteration using the given tolerance (a small positive float)to determine if the algorithm converged early. - `random_seed` : int (default: None) Set random state for the initial centroid assignment. - `print_progress` : int (default: 0) Prints progress in fitting to stderr. 0: No output 1: Iterations elapsed 2: 1 plus time elapsed 3: 2 plus estimated time until completion **Attributes** - `centroids_` : 2d-array, shape={k, n_features} Feature values of the k cluster centroids. - `custers_` : dictionary The cluster assignments stored as a Python dictionary; the dictionary keys denote the cluster indeces and the items are Python lists of the sample indices that were assigned to each cluster. - `iterations_` : int Number of iterations until convergence. ### Methods <hr> *fit(X, init_params=True)* Learn model from training data. **Parameters** - `X` : {array-like, sparse matrix}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. - `init_params` : bool (default: True) Re-initializes model parameters prior to fitting. Set False to continue training with weights from a previous model fitting. **Returns** - `self` : object <hr> *predict(X)* Predict targets from X. **Parameters** - `X` : {array-like, sparse matrix}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. **Returns** - `target_values` : array-like, shape = [n_samples] Predicted target values.