Created using: PyCaret 2.2
Date Updated: November 25, 2020
Welcome to the Clustering Tutorial (CLU101) - Level Beginner. This tutorial assumes that you are new to PyCaret and looking to get started with clustering using pycaret.clustering
Module.
In this tutorial we will learn:
Read Time : Approx. 25 Minutes
The first step to get started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.
Clustering is the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups. It is an exploratory data mining activity, and a common technique for statistical data analysis used in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics. Some common real life use cases of clustering are:
PyCaret's clustering module (pycaret.clustering
) is a an unsupervised machine learning module which performs the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups.
PyCaret's clustering module provides several pre-processing features that can be configured when initializing the setup through the setup()
function. It has over 8 algorithms and several plots to analyze the results. PyCaret's clustering module also implements a unique function called tune_model()
that allows you to tune the hyperparameters of a clustering model to optimize a supervised learning objective such as AUC
for classification or R2
for regression.
For this tutorial we will use a dataset from UCI called Mice Protein Expression. The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse. Click Here to read more about the dataset.
Clara Higuera Department of Software Engineering and Artificial Intelligence, Faculty of Informatics and the Department of Biochemistry and Molecular Biology, Faculty of Chemistry, University Complutense, Madrid, Spain. Email: clarahiguera@ucm.es
Katheleen J. Gardiner, creator and owner of the protein expression data, is currently with the Linda Crnic Institute for Down Syndrome, Department of Pediatrics, Department of Biochemistry and Molecular Genetics, Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Aurora, Colorado, USA. Email: katheleen.gardiner@ucdenver.edu
Krzysztof J. Cios is currently with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA, and IITiS Polish Academy of Sciences, Poland. Email: kcios@vcu.edu
The original dataset and data dictionary can be found here.
You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using the get_data()
function (This will require internet connection).
from pycaret.datasets import get_data
dataset = get_data('mice')
MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | pCFOS_N | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 309_1 | 0.503644 | 0.747193 | 0.430175 | 2.816329 | 5.990152 | 0.218830 | 0.177565 | 2.373744 | 0.232224 | ... | 0.108336 | 0.427099 | 0.114783 | 0.131790 | 0.128186 | 1.675652 | Control | Memantine | C/S | c-CS-m |
1 | 309_2 | 0.514617 | 0.689064 | 0.411770 | 2.789514 | 5.685038 | 0.211636 | 0.172817 | 2.292150 | 0.226972 | ... | 0.104315 | 0.441581 | 0.111974 | 0.135103 | 0.131119 | 1.743610 | Control | Memantine | C/S | c-CS-m |
2 | 309_3 | 0.509183 | 0.730247 | 0.418309 | 2.687201 | 5.622059 | 0.209011 | 0.175722 | 2.283337 | 0.230247 | ... | 0.106219 | 0.435777 | 0.111883 | 0.133362 | 0.127431 | 1.926427 | Control | Memantine | C/S | c-CS-m |
3 | 309_4 | 0.442107 | 0.617076 | 0.358626 | 2.466947 | 4.979503 | 0.222886 | 0.176463 | 2.152301 | 0.207004 | ... | 0.111262 | 0.391691 | 0.130405 | 0.147444 | 0.146901 | 1.700563 | Control | Memantine | C/S | c-CS-m |
4 | 309_5 | 0.434940 | 0.617430 | 0.358802 | 2.365785 | 4.718679 | 0.213106 | 0.173627 | 2.134014 | 0.192158 | ... | 0.110694 | 0.434154 | 0.118481 | 0.140314 | 0.148380 | 1.839730 | Control | Memantine | C/S | c-CS-m |
5 rows × 82 columns
#check the shape of data
dataset.shape
(1080, 82)
In order to demonstrate the predict_model()
function on unseen data, a sample of 5% (54 records) has been withheld from the original dataset to be used for predictions at the end of experiment. This should not be confused with train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 54 samples were not available at the time when this experiment was performed.
data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)
The setup()
function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup()
must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).
When setup()
is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup()
is executed. If all of the data types are correctly identified enter
can be pressed to continue or quit
can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.
In later tutorials we will learn how to overwrite PyCaret's infered data type using the numeric_features
and categorical_features
parameters in setup()
.
from pycaret.clustering import *
exp_clu101 = setup(data, normalize = True,
ignore_features = ['MouseID'],
session_id = 123)
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Original Data | (1026, 82) |
2 | Missing Values | True |
3 | Numeric Features | 77 |
4 | Categorical Features | 4 |
5 | Ordinal Features | False |
6 | High Cardinality Features | False |
7 | High Cardinality Method | None |
8 | Transformed Data | (1026, 91) |
9 | CPU Jobs | -1 |
10 | Use GPU | False |
11 | Log Experiment | False |
12 | Experiment Name | cluster-default-name |
13 | USI | 39ba |
14 | Imputation Type | simple |
15 | Iterative Imputation Iteration | None |
16 | Numeric Imputer | mean |
17 | Iterative Imputation Numeric Model | None |
18 | Categorical Imputer | mode |
19 | Iterative Imputation Categorical Model | None |
20 | Unknown Categoricals Handling | least_frequent |
21 | Normalize | True |
22 | Normalize Method | zscore |
23 | Transformation | False |
24 | Transformation Method | None |
25 | PCA | False |
26 | PCA Method | None |
27 | PCA Components | None |
28 | Ignore Low Variance | False |
29 | Combine Rare Levels | False |
30 | Rare Level Threshold | None |
31 | Numeric Binning | False |
32 | Remove Outliers | False |
33 | Outliers Threshold | None |
34 | Remove Multicollinearity | False |
35 | Multicollinearity Threshold | None |
36 | Clustering | False |
37 | Clustering Iteration | None |
38 | Polynomial Features | False |
39 | Polynomial Degree | None |
40 | Trignometry Features | False |
41 | Polynomial Threshold | None |
42 | Group Features | False |
43 | Feature Selection | False |
44 | Features Selection Threshold | None |
45 | Feature Interaction | False |
46 | Feature Ratio | False |
47 | Interaction Threshold | None |
Once the setup has been successfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup()
is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:
session_id
is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id
is set as 123
for later reproducibility.Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation, categorical encoding etc. Most of the parameters in setup()
are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.
Training a cluster model in PyCaret is simple and similar to how you would create a model in the supervised learning modules. A clustering model is created using the create_model()
function which takes one mandatory parameter: the ID of a model you want to train. This function returns a trained model object and few unsupervised metrics. See an example below:
kmeans = create_model('kmeans')
Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
---|---|---|---|---|---|---|
0 | 0.1187 | 137.5261 | 2.0715 | 0 | 0 | 0 |
print(kmeans)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=10, n_jobs=-1, precompute_distances='deprecated', random_state=123, tol=0.0001, verbose=0)
We have created a kmeans model using create_model()
. Notice the n_clusters
parameter is set to 4
which is the default when you do not pass a value to the num_clusters
parameter. In the below example we will create a kmodes
model with 6 clusters.
kmodes = create_model('kmodes', num_clusters = 6)
Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
---|---|---|---|---|---|---|
0 | 0.0262 | 47.0119 | 3.7958 | 0 | 0 | 0 |
print(kmodes)
KModes(cat_dissim=<function matching_dissim at 0x000001F326705EA0>, init='Cao', max_iter=100, n_clusters=6, n_init=1, n_jobs=-1, random_state=123, verbose=0)
Simply replacing kmeans
with kmodes
inside create_model()
has created akmodes
clustering model. There are 9 models available in the pycaret.clustering
module. To see the complete list of models in the library please see docstring
or use models
function.
models()
Name | Reference | |
---|---|---|
ID | ||
kmeans | K-Means Clustering | sklearn.cluster._kmeans.KMeans |
ap | Affinity Propagation | sklearn.cluster._affinity_propagation.Affinity... |
meanshift | Mean Shift Clustering | sklearn.cluster._mean_shift.MeanShift |
sc | Spectral Clustering | sklearn.cluster._spectral.SpectralClustering |
hclust | Agglomerative Clustering | sklearn.cluster._agglomerative.AgglomerativeCl... |
dbscan | Density-Based Spatial Clustering | sklearn.cluster._dbscan.DBSCAN |
optics | OPTICS Clustering | sklearn.cluster._optics.OPTICS |
birch | Birch Clustering | sklearn.cluster._birch.Birch |
kmodes | K-Modes Clustering | kmodes.kmodes.KModes |
Now that we have created a model, we would like to assign the cluster labels to our dataset (1080 samples) to analyze the results. We will achieve this by using the assign_model()
function. See an example below:
kmean_results = assign_model(kmeans)
kmean_results.head()
MouseID | DYRK1A_N | ITSN1_N | BDNF_N | NR1_N | NR2A_N | pAKT_N | pBRAF_N | pCAMKII_N | pCREB_N | ... | SYP_N | H3AcK18_N | EGR1_N | H3MeK4_N | CaNA_N | Genotype | Treatment | Behavior | class | Cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3501_12 | 0.344930 | 0.626194 | 0.383583 | 2.534561 | 4.097317 | 0.303547 | 0.222829 | 4.592769 | 0.239427 | ... | 0.455172 | 0.252700 | 0.218868 | 0.249187 | 1.139493 | Ts65Dn | Memantine | S/C | t-SC-m | Cluster 3 |
1 | 3520_5 | 0.630001 | 0.839187 | 0.357777 | 2.651229 | 4.261675 | 0.253184 | 0.185257 | 3.816673 | 0.204940 | ... | 0.496423 | 0.155008 | 0.153219 | NaN | 1.642886 | Control | Memantine | C/S | c-CS-m | Cluster 0 |
2 | 3414_13 | 0.555122 | 0.726229 | 0.278319 | 2.097249 | 2.897553 | 0.222222 | 0.174356 | 1.867880 | 0.203379 | ... | 0.344964 | 0.136109 | 0.155530 | 0.185484 | 1.657670 | Ts65Dn | Memantine | C/S | t-CS-m | Cluster 2 |
3 | 3488_8 | 0.275849 | 0.430764 | 0.285166 | 2.265254 | 3.250091 | 0.189258 | 0.157837 | 2.917611 | 0.202594 | ... | 0.390880 | 0.127944 | 0.207671 | 0.175357 | 0.893598 | Control | Saline | S/C | c-SC-s | Cluster 1 |
4 | 3501_7 | 0.304788 | 0.617299 | 0.335164 | 2.638236 | 4.876609 | 0.280590 | 0.199417 | 4.835421 | 0.236314 | ... | 0.470932 | 0.245277 | 0.202171 | 0.240372 | 0.795637 | Ts65Dn | Memantine | S/C | t-SC-m | Cluster 3 |
5 rows × 83 columns
Notice that a new column called Cluster
has been added to the original dataset. kmean_results
also includes the MouseID
feature that we dropped during the setup()
but it was not used for the model and is only appended to the dataset when you use assign_model()
. In the next section we will see how to analyze the results of clustering using plot_model()
.
The plot_model()
function can be used to analyze different aspects of the clustering model. This function takes a trained model object and returns a plot. See examples below:
plot_model(kmeans)