Last updated: 16 Feb 2023
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive.
Compared with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with a few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.
The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.
PyCaret is tested and supported on the following 64-bit systems:
You can install PyCaret with Python's pip package manager:
pip install pycaret
PyCaret's default installation will not install all the extra dependencies automatically. For that you will have to install the full version:
pip install pycaret[full]
or depending on your use-case you may install one of the following variant:
pip install pycaret[analysis]
pip install pycaret[models]
pip install pycaret[tuner]
pip install pycaret[mlops]
pip install pycaret[parallel]
pip install pycaret[test]
# check installed version
import pycaret
pycaret.__version__
'3.0.0'
PyCaret's Clustering Module is an unsupervised machine learning module that performs the task of grouping a set of objects in such a way that objects in the same group (also known as a cluster) are more similar to each other than to those in other groups.
It provides several pre-processing features that prepare the data for modeling through the setup function. It has over 10 ready-to-use algorithms and several plots to analyze the performance of trained models.
A typical workflow in PyCaret's unsupervised module consist of following 6 steps in this order:
# loading sample dataset from pycaret dataset module
from pycaret.datasets import get_data
data = get_data('jewellery')
Age | Income | SpendingScore | Savings | |
---|---|---|---|---|
0 | 58 | 77769 | 0.791329 | 6559.829923 |
1 | 59 | 81799 | 0.791082 | 5417.661426 |
2 | 62 | 74751 | 0.702657 | 9258.992965 |
3 | 59 | 74373 | 0.765680 | 7346.334504 |
4 | 87 | 17760 | 0.348778 | 16869.507130 |
This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function in PyCaret. It only has one required parameter i.e. data
. All the other parameters are optional.
# import pycaret clustering and init setup
from pycaret.clustering import *
s = setup(data, session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (505, 4) |
2 | Transformed data shape | (505, 4) |
3 | Numeric features | 4 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | cluster-default-name |
12 | USI | 3c6c |
Once the setup has been successfully executed it shows the information grid containing experiment level information.
session_id
is passed, a random number is automatically generated that is distributed to all functions.PyCaret has two set of API's that you can work with. (1) Functional (as seen above) and (2) Object Oriented API.
With Object Oriented API instead of executing functions directly you will import a class and execute methods of class.
# import ClusteringExperiment and init the class
from pycaret.clustering import ClusteringExperiment
exp = ClusteringExperiment()
# check the type of exp
type(exp)
pycaret.clustering.oop.ClusteringExperiment
# init setup on exp
exp.setup(data, session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Original data shape | (505, 4) |
2 | Transformed data shape | (505, 4) |
3 | Numeric features | 4 |
4 | Preprocess | True |
5 | Imputation type | simple |
6 | Numeric imputation | mean |
7 | Categorical imputation | mode |
8 | CPU Jobs | -1 |
9 | Use GPU | False |
10 | Log Experiment | False |
11 | Experiment Name | cluster-default-name |
12 | USI | 6c6d |
<pycaret.clustering.oop.ClusteringExperiment at 0x16a6fe44e50>
You can use any of the two method i.e. Functional or OOP and even switch back and forth between two set of API's. The choice of method will not impact the results and has been tested for consistency. ___
This function trains and evaluates the performance of a given model. Metrics evaluated can be accessed using the get_metrics
function. Custom metrics can be added or removed using the add_metric
and remove_metric
function. All the available models can be accessed using the models
function.
# train kmeans model
kmeans = create_model('kmeans')
Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
---|---|---|---|---|---|---|
0 | 0.7207 | 5011.8115 | 0.4114 | 0 | 0 | 0 |
Processing: 0%| | 0/3 [00:00<?, ?it/s]
# to check all the available models
models()
Name | Reference | |
---|---|---|
ID | ||
kmeans | K-Means Clustering | sklearn.cluster._kmeans.KMeans |
ap | Affinity Propagation | sklearn.cluster._affinity_propagation.Affinity... |
meanshift | Mean Shift Clustering | sklearn.cluster._mean_shift.MeanShift |
sc | Spectral Clustering | sklearn.cluster._spectral.SpectralClustering |
hclust | Agglomerative Clustering | sklearn.cluster._agglomerative.AgglomerativeCl... |
dbscan | Density-Based Spatial Clustering | sklearn.cluster._dbscan.DBSCAN |
optics | OPTICS Clustering | sklearn.cluster._optics.OPTICS |
birch | Birch Clustering | sklearn.cluster._birch.Birch |
kmodes | K-Modes Clustering | kmodes.kmodes.KModes |
# train meanshift model
meanshift = create_model('meanshift')
Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
---|---|---|---|---|---|---|
0 | 0.7393 | 3567.5370 | 0.3435 | 0 | 0 | 0 |
Processing: 0%| | 0/3 [00:00<?, ?it/s]
This function assigns cluster labels to the training data, given a trained model.
kmeans_cluster = assign_model(kmeans)
kmeans_cluster
Age | Income | SpendingScore | Savings | Cluster | |
---|---|---|---|---|---|
0 | 58 | 77769 | 0.791329 | 6559.830078 | Cluster 2 |
1 | 59 | 81799 | 0.791082 | 5417.661621 | Cluster 2 |
2 | 62 | 74751 | 0.702657 | 9258.993164 | Cluster 2 |
3 | 59 | 74373 | 0.765680 | 7346.334473 | Cluster 2 |
4 | 87 | 17760 | 0.348778 | 16869.507812 | Cluster 0 |
... | ... | ... | ... | ... | ... |
500 | 28 | 101206 | 0.387441 | 14936.775391 | Cluster 1 |
501 | 93 | 19934 | 0.203140 | 17969.693359 | Cluster 0 |
502 | 90 | 35297 | 0.355149 | 16091.402344 | Cluster 0 |
503 | 91 | 20681 | 0.354679 | 18401.087891 | Cluster 0 |
504 | 89 | 30267 | 0.289310 | 14386.351562 | Cluster 0 |
505 rows × 5 columns
You can use the plot_model
function to analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases.
# plot pca cluster plot
plot_model(kmeans, plot = 'cluster')