👉 What is PyCaret?¶

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and many more.

The design and simplicity of PyCaret is inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more expertise. Seasoned data scientists are often difficult to find and expensive to hire but citizen data scientists can be an effective way to mitigate this gap and address data-related challenges in the business setting.

Official Website: https://www.pycaret.org Documentation: https://pycaret.readthedocs.io/en/latest/

👉 Install PyCaret¶

Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries. PyCaret's default installation is a slim version of pycaret that only installs hard dependencies that are listed in requirements.txt. To install the default version:

pip install pycaret

When you install the full version of pycaret, all the optional dependencies as listed here are also installed.To install version:

pip install pycaret[full]

👉Dataset¶

In [1]:

import pandas as pd
import numpy as np

data = pd.read_csv('c:/users/moezs/pycaret-demo-mlflow/AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Date'])
data.head()

Out[1]:

	Date	Passengers
0	1949-01-01	112
1	1949-02-01	118
2	1949-03-01	132
3	1949-04-01	129
4	1949-05-01	121

In [2]:

import plotly.express as px
data['MA12'] = data['Passengers'].rolling(12).mean()
fig = px.line(data, x="Date", y=["Passengers", "MA12"], template = 'plotly_dark')
fig.show()

👉 Data Preparation¶

In [3]:

# extract features from date
data['Month'] = [i.month for i in data['Date']]
data['Year'] = [i.year for i in data['Date']]
data['Series'] = np.arange(1,len(data)+1).astype('int64')

# drop date and MA12
data.drop(['Date', 'MA12'], axis=1, inplace=True)

# rearrange columns
data = data[['Series', 'Year', 'Month', 'Passengers']] #re-arrange columns

# check head
data.head()

Out[3]:

	Series	Year	Month	Passengers
0	1	1949	1	112
1	2	1949	2	118
2	3	1949	3	132
3	4	1949	4	129
4	5	1949	5	121

In [4]:

train = data[data['Year'] < 1960]
test = data[data['Year'] >= 1960]
train.shape, test.shape

Out[4]:

((132, 4), (12, 4))

In [6]:

from pycaret.regression import *

s = setup(data = train, test_data = test,
          target = 'Passengers', 
          fold_strategy = 'timeseries',
          numeric_features = ['Year', 'Series'],
          fold = 3,
          transform_target = True,
          session_id = 123, silent = True)

	Description	Value
0	session_id	123
1	Target	Passengers
2	Original Data	(132, 4)
3	Missing Values	False
4	Numeric Features	2
5	Categorical Features	1
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(132, 13)
10	Transformed Test Set	(12, 13)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	TimeSeriesSplit
14	Fold Number	3
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	f21c
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Clustering	False
43	Clustering Iteration	None
44	Polynomial Features	False
45	Polynomial Degree	None
46	Trignometry Features	False
47	Polynomial Threshold	None
48	Group Features	False
49	Feature Selection	False
50	Feature Selection Method	classic
51	Features Selection Threshold	None
52	Feature Interaction	False
53	Feature Ratio	False
54	Interaction Threshold	None
55	Transform Target	True
56	Transform Target Method	box-cox

In [7]:

# check X_train index
get_config('X_train').index

Out[7]:

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            122, 123, 124, 125, 126, 127, 128, 129, 130, 131],
           dtype='int64', length=132)

In [8]:

# check X_test index
get_config('X_test').index

Out[8]:

Int64Index([132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143], dtype='int64')

👉Model Training & Selection¶

Compare Models¶

In [9]:

# train all models using default hyperparameters
best = compare_models(sort = 'MAE')

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lar	Least Angle Regression	22.3980	923.8654	28.2855	0.5621	0.0878	0.0746	0.0333
lr	Linear Regression	22.3981	923.8749	28.2856	0.5621	0.0878	0.0746	1.8967
huber	Huber Regressor	22.4192	891.5686	27.9350	0.5988	0.0879	0.0749	0.0300
br	Bayesian Ridge	22.4783	932.2165	28.5483	0.5611	0.0884	0.0746	0.0233
ridge	Ridge Regression	23.1976	1003.9426	30.0410	0.5258	0.0933	0.0764	1.3533
lasso	Lasso Regression	38.4188	2413.5109	46.8468	0.0882	0.1473	0.1241	1.5033
en	Elastic Net	40.6486	2618.8759	49.4048	-0.0824	0.1563	0.1349	0.0300
omp	Orthogonal Matching Pursuit	44.3054	3048.2658	53.8613	-0.4499	0.1713	0.1520	0.0233
xgboost	Extreme Gradient Boosting	46.7192	3791.0476	59.9683	-0.5515	0.1962	0.1432	0.2433
gbr	Gradient Boosting Regressor	50.1217	4032.0567	61.2306	-0.6189	0.2034	0.1538	0.0400
rf	Random Forest Regressor	52.3637	4647.0635	65.2883	-0.7726	0.2131	0.1578	0.2267
catboost	CatBoost Regressor	53.6141	4414.8319	64.3184	-0.7792	0.2161	0.1653	1.6533
et	Extra Trees Regressor	54.6312	4500.5115	64.0882	-0.7207	0.2146	0.1675	0.2467
ada	AdaBoost Regressor	55.0753	5128.1587	68.9577	-0.9915	0.2277	0.1667	0.0667
dt	Decision Tree Regressor	57.9293	6230.5556	70.9838	-0.9553	0.2265	0.1700	0.0233
knn	K Neighbors Regressor	64.1165	7098.4735	78.7031	-1.4511	0.2582	0.1882	0.0700
lightgbm	Light Gradient Boosting Machine	76.8521	8430.4943	91.0063	-2.9097	0.3379	0.2490	0.0533
llar	Lasso Least Angle Regression	129.0182	21858.5806	138.1309	-6.5554	0.5446	0.3958	0.0233
par	Passive Aggressive Regressor	156.1775	95107.3645	210.3616	-93.7884	0.4304	0.6643	0.0267

In [13]:

print(best)

PowerTransformedTargetRegressor(copy_X=True, eps=2.220446049250313e-16,
                                fit_intercept=True, fit_path=True, jitter=None,
                                n_nonzero_coefs=500, normalize=True,
                                power_transformer_method='box-cox',
                                power_transformer_standardize=True,
                                precompute='auto', random_state=123,
                                regressor=Lars(copy_X=True,
                                               eps=2.220446049250313e-16,
                                               fit_intercept=True,
                                               fit_path=True, jitter=None,
                                               n_nonzero_coefs=500,
                                               normalize=True,
                                               precompute='auto',
                                               random_state=123,
                                               verbose=False),
                                verbose=False)

In [14]:

# check on hold-out
pred_holdout = predict_model(best);

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	Least Angle Regression	25.0714	972.2733	31.1813	0.8245	0.0692	0.0571

In [15]:

evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

Plot Actual and Predicted values¶

In [16]:

predictions = predict_model(best, data=data)
predictions['Date'] = pd.date_range(start='1949-01-01', end = '1960-12-01', freq = 'MS')

import plotly.express as px
fig = px.line(predictions, x='Date', y=["Passengers", "Label"], template = 'plotly_dark')
fig.add_vrect(x0="1960-01-01", x1="1960-12-01", fillcolor="grey", opacity=0.25, line_width=0)

fig.show()

Predict Future Values¶

In [17]:

# create future dataset to score
future_dates = pd.date_range(start = '1961-01-01', end = '1965-01-01', freq = 'MS')

future_df = pd.DataFrame()

future_df['Month'] = [i.month for i in future_dates]
future_df['Year'] = [i.year for i in future_dates]    
future_df['Series'] = np.arange(145,(145+len(future_dates)))

future_df.head()

Out[17]:

	Month	Year	Series
0	1	1961	145
1	2	1961	146
2	3	1961	147
3	4	1961	148
4	5	1961	149

In [18]:

# finalize model
final_best = finalize_model(best)

In [19]:

# generate predictions on future dataset
predictions_future = predict_model(final_best, data=future_df)
predictions_future.head()

Out[19]:

	Month	Year	Series	Label
0	1	1961	145	486.278268
1	2	1961	146	482.208187
2	3	1961	147	550.485967
3	4	1961	148	535.187177
4	5	1961	149	538.923789

In [20]:

# plot historic and predicted values
concat_df = pd.concat([data,predictions_future], axis=0)
concat_df_i = pd.date_range(start='1949-01-01', end = '1965-01-01', freq = 'MS')
concat_df.set_index(concat_df_i, inplace=True)

import plotly.express as px
fig = px.line(concat_df, x=concat_df.index, y=["Passengers", "Label"], template = 'plotly_dark')
fig.show()

THE END¶

In [ ]: