Clustering Tutorial (CLU101) - Level Beginner

Created using: PyCaret 2.2
Date Updated: November 25, 2020

1.0 Tutorial Objective

Welcome to the Clustering Tutorial (CLU101) - Level Beginner. This tutorial assumes that you are new to PyCaret and looking to get started with clustering using pycaret.clustering Module.

In this tutorial we will learn:

  • Getting Data: How to import data from PyCaret repository
  • Setting up Environment: How to setup an experiment in PyCaret and get started with building multiclass models
  • Create Model: How to create a model and assign cluster labels to the original dataset for analysis
  • Plot Model: How to analyze model performance using various plots
  • Predict Model: How to assign cluster labels to new and unseen datasets based on a trained model
  • Save / Load Model: How to save / load model for future use

Read Time : Approx. 25 Minutes

1.1 Installing PyCaret

The first step to get started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:

Installing PyCaret in Local Jupyter Notebook

pip install pycaret

Installing PyCaret on Google Colab or Azure Notebooks

!pip install pycaret

1.2 Pre-Requisites

  • Python 3.6 or greater
  • PyCaret 2.0 or greater
  • Internet connection to load data from pycaret's repository
  • Basic Knowledge of Clustering

1.3 For Google colab users:

If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.

from pycaret.utils import enable_colab

1.4 See also:

2.0 What is Clustering?

Clustering is the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups. It is an exploratory data mining activity, and a common technique for statistical data analysis used in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics. Some common real life use cases of clustering are:

  • Customer segmentation based on purchase history or interests to design targetted marketing compaigns.
  • Cluster documents into multiple categories based on tags, topics, and the content of the document.
  • Analysis of outcome in social / life science experiments to find natural groupings and patterns in the data.

Learn More about Clustering

3.0 Overview of Clustering Module in PyCaret

PyCaret's clustering module (pycaret.clustering) is a an unsupervised machine learning module which performs the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups.

PyCaret's clustering module provides several pre-processing features that can be configured when initializing the setup through the setup() function. It has over 8 algorithms and several plots to analyze the results. PyCaret's clustering module also implements a unique function called tune_model() that allows you to tune the hyperparameters of a clustering model to optimize a supervised learning objective such as AUC for classification or R2 for regression.

4.0 Dataset for the Tutorial

For this tutorial we will use a dataset from UCI called Mice Protein Expression. The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse. Click Here to read more about the dataset.

Dataset Acknowledgement:

Clara Higuera Department of Software Engineering and Artificial Intelligence, Faculty of Informatics and the Department of Biochemistry and Molecular Biology, Faculty of Chemistry, University Complutense, Madrid, Spain. Email: [email protected]

Katheleen J. Gardiner, creator and owner of the protein expression data, is currently with the Linda Crnic Institute for Down Syndrome, Department of Pediatrics, Department of Biochemistry and Molecular Genetics, Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Aurora, Colorado, USA. Email: [email protected]

Krzysztof J. Cios is currently with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA, and IITiS Polish Academy of Sciences, Poland. Email: [email protected]

The original dataset and data dictionary can be found here.

5.0 Getting the Data

You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using the get_data() function (This will require internet connection).

In [1]:
from pycaret.datasets import get_data
dataset = get_data('mice')
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... pCFOS_N SYP_N H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class
0 309_1 0.503644 0.747193 0.430175 2.816329 5.990152 0.218830 0.177565 2.373744 0.232224 ... 0.108336 0.427099 0.114783 0.131790 0.128186 1.675652 Control Memantine C/S c-CS-m
1 309_2 0.514617 0.689064 0.411770 2.789514 5.685038 0.211636 0.172817 2.292150 0.226972 ... 0.104315 0.441581 0.111974 0.135103 0.131119 1.743610 Control Memantine C/S c-CS-m
2 309_3 0.509183 0.730247 0.418309 2.687201 5.622059 0.209011 0.175722 2.283337 0.230247 ... 0.106219 0.435777 0.111883 0.133362 0.127431 1.926427 Control Memantine C/S c-CS-m
3 309_4 0.442107 0.617076 0.358626 2.466947 4.979503 0.222886 0.176463 2.152301 0.207004 ... 0.111262 0.391691 0.130405 0.147444 0.146901 1.700563 Control Memantine C/S c-CS-m
4 309_5 0.434940 0.617430 0.358802 2.365785 4.718679 0.213106 0.173627 2.134014 0.192158 ... 0.110694 0.434154 0.118481 0.140314 0.148380 1.839730 Control Memantine C/S c-CS-m

5 rows × 82 columns

In [2]:
#check the shape of data
(1080, 82)

In order to demonstrate the predict_model() function on unseen data, a sample of 5% (54 records) has been withheld from the original dataset to be used for predictions at the end of experiment. This should not be confused with train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 54 samples were not available at the time when this experiment was performed.

In [3]:
data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (1026, 82)
Unseen Data For Predictions: (54, 82)

6.0 Setting up Environment in PyCaret

The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).

When setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all of the data types are correctly identified enter can be pressed to continue or quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.

In later tutorials we will learn how to overwrite PyCaret's infered data type using the numeric_features and categorical_features parameters in setup().

In [4]:
from pycaret.clustering import *

exp_clu101 = setup(data, normalize = True, 
                   ignore_features = ['MouseID'],
                   session_id = 123)
Description Value
0 session_id 123
1 Original Data (1026, 82)
2 Missing Values True
3 Numeric Features 77
4 Categorical Features 4
5 Ordinal Features False
6 High Cardinality Features False
7 High Cardinality Method None
8 Transformed Data (1026, 91)
9 CPU Jobs -1
10 Use GPU False
11 Log Experiment False
12 Experiment Name cluster-default-name
13 USI 39ba
14 Imputation Type simple
15 Iterative Imputation Iteration None
16 Numeric Imputer mean
17 Iterative Imputation Numeric Model None
18 Categorical Imputer mode
19 Iterative Imputation Categorical Model None
20 Unknown Categoricals Handling least_frequent
21 Normalize True
22 Normalize Method zscore
23 Transformation False
24 Transformation Method None
25 PCA False
26 PCA Method None
27 PCA Components None
28 Ignore Low Variance False
29 Combine Rare Levels False
30 Rare Level Threshold None
31 Numeric Binning False
32 Remove Outliers False
33 Outliers Threshold None
34 Remove Multicollinearity False
35 Multicollinearity Threshold None
36 Clustering False
37 Clustering Iteration None
38 Polynomial Features False
39 Polynomial Degree None
40 Trignometry Features False
41 Polynomial Threshold None
42 Group Features False
43 Feature Selection False
44 Features Selection Threshold None
45 Feature Interaction False
46 Feature Ratio False
47 Interaction Threshold None

Once the setup has been successfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup() is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:

  • session_id : A pseduo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 123 for later reproducibility.

  • Missing Values : When there are missing values in original data this will show as True. Notice that Missing Values in the information grid above is True as the data contains missing values which are automatically imputed using mean for numeric features and constant for categorical features. The method of imputation can be changed using the numeric_imputation and categorical_imputation parameters in setup().

  • Original Data : Displays the original shape of dataset. In this experiment (1026, 82) means 1026 samples and 82 features.

  • Transformed Data : Displays the shape of the transformed dataset. Notice that the shape of the original dataset (1026, 82) is transformed into (1026, 91). The number of features has increased due to encoding of categorical features in the dataset.

  • Numeric Features : The number of features inferred as numeric. In this dataset, 77 out of 82 features are inferred as numeric.

  • Categorical Features : The number of features inferred as categorical. In this dataset, 5 out of 82 features are inferred as categorical. Also notice that we have ignored one categorical feature MouseID using the ignore_feature parameter.

Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation, categorical encoding etc. Most of the parameters in setup() are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.

7.0 Create a Model

Training a cluster model in PyCaret is simple and similar to how you would create a model in the supervised learning modules. A clustering model is created using the create_model() function which takes one mandatory parameter: the ID of a model you want to train. This function returns a trained model object and few unsupervised metrics. See an example below:

In [5]:
kmeans = create_model('kmeans')
Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.1187 137.5261 2.0715 0 0 0
In [6]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=-1, precompute_distances='deprecated',
       random_state=123, tol=0.0001, verbose=0)

We have created a kmeans model using create_model(). Notice the n_clusters parameter is set to 4 which is the default when you do not pass a value to the num_clusters parameter. In the below example we will create a kmodes model with 6 clusters.

In [7]:
kmodes = create_model('kmodes', num_clusters = 6)
Silhouette Calinski-Harabasz Davies-Bouldin Homogeneity Rand Index Completeness
0 0.0262 47.0119 3.7958 0 0 0
In [8]:
KModes(cat_dissim=<function matching_dissim at 0x000001F326705EA0>, init='Cao',
       max_iter=100, n_clusters=6, n_init=1, n_jobs=-1, random_state=123,

Simply replacing kmeans with kmodes inside create_model() has created akmodes clustering model. There are 9 models available in the pycaret.clustering module. To see the complete list of models in the library please see docstring or use models function.

In [9]:
Name Reference
kmeans K-Means Clustering sklearn.cluster._kmeans.KMeans
ap Affinity Propagation sklearn.cluster._affinity_propagation.Affinity...
meanshift Mean Shift Clustering sklearn.cluster._mean_shift.MeanShift
sc Spectral Clustering sklearn.cluster._spectral.SpectralClustering
hclust Agglomerative Clustering sklearn.cluster._agglomerative.AgglomerativeCl...
dbscan Density-Based Spatial Clustering sklearn.cluster._dbscan.DBSCAN
optics OPTICS Clustering sklearn.cluster._optics.OPTICS
birch Birch Clustering sklearn.cluster._birch.Birch
kmodes K-Modes Clustering kmodes.kmodes.KModes

8.0 Assign a Model

Now that we have created a model, we would like to assign the cluster labels to our dataset (1080 samples) to analyze the results. We will achieve this by using the assign_model() function. See an example below:

In [10]:
kmean_results = assign_model(kmeans)
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... SYP_N H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class Cluster
0 3501_12 0.344930 0.626194 0.383583 2.534561 4.097317 0.303547 0.222829 4.592769 0.239427 ... 0.455172 0.252700 0.218868 0.249187 1.139493 Ts65Dn Memantine S/C t-SC-m Cluster 3
1 3520_5 0.630001 0.839187 0.357777 2.651229 4.261675 0.253184 0.185257 3.816673 0.204940 ... 0.496423 0.155008 0.153219 NaN 1.642886 Control Memantine C/S c-CS-m Cluster 0
2 3414_13 0.555122 0.726229 0.278319 2.097249 2.897553 0.222222 0.174356 1.867880 0.203379 ... 0.344964 0.136109 0.155530 0.185484 1.657670 Ts65Dn Memantine C/S t-CS-m Cluster 2
3 3488_8 0.275849 0.430764 0.285166 2.265254 3.250091 0.189258 0.157837 2.917611 0.202594 ... 0.390880 0.127944 0.207671 0.175357 0.893598 Control Saline S/C c-SC-s Cluster 1
4 3501_7 0.304788 0.617299 0.335164 2.638236 4.876609 0.280590 0.199417 4.835421 0.236314 ... 0.470932 0.245277 0.202171 0.240372 0.795637 Ts65Dn Memantine S/C t-SC-m Cluster 3

5 rows × 83 columns

Notice that a new column called Cluster has been added to the original dataset. kmean_results also includes the MouseID feature that we dropped during the setup() but it was not used for the model and is only appended to the dataset when you use assign_model(). In the next section we will see how to analyze the results of clustering using plot_model().

9.0 Plot a Model

The plot_model() function can be used to analyze different aspects of the clustering model. This function takes a trained model object and returns a plot. See examples below:

9.1 Cluster PCA Plot

In [11]: