Binary Classification Tutorial (CLF101) - Level Beginner¶

In [ ]:

! pip uninstall pycaret
!pip install git+https://github.com/amjadraza/pycaret.git@feature/gcp_zure_integration

In [2]:

from pycaret.classification import *

In [3]:

from pycaret.datasets import get_data
dataset = get_data('credit')

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default
0	20000	2	2	1	24	2	2	-1	-1	-2	-2	3913.0	3102.0	689.0	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	90000	2	2	2	34	0	0	0	0	0	0	29239.0	14027.0	13559.0	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
2	50000	2	2	1	37	0	0	0	0	0	0	46990.0	48233.0	49291.0	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
3	50000	1	2	1	57	-1	0	-1	0	0	0	8617.0	5670.0	35835.0	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0
4	50000	1	1	2	37	0	0	0	0	0	0	64400.0	57069.0	57608.0	19394.0	19619.0	20024.0	2500.0	1815.0	657.0	1000.0	1000.0	800.0	0

In [4]:

data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (22800, 24)
Unseen Data For Predictions: (1200, 24)

In [ ]:

exp_clf101 = setup(data = data, target = 'default', session_id=123) 

In [6]:

rf = create_model('rf')

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8095	0.7531	0.3428	0.6269	0.4432	0.3400	0.3626
1	0.8127	0.7451	0.3399	0.6452	0.4453	0.3453	0.3710
2	0.8076	0.7714	0.3258	0.6250	0.4283	0.3262	0.3512
3	0.7989	0.7185	0.3144	0.5842	0.4088	0.3006	0.3215
4	0.8051	0.7249	0.3229	0.6129	0.4230	0.3191	0.3428
5	0.8152	0.7324	0.3569	0.6495	0.4607	0.3603	0.3839
6	0.8039	0.7244	0.3371	0.6010	0.4319	0.3246	0.3444
7	0.8158	0.7711	0.3399	0.6630	0.4494	0.3523	0.3807
8	0.8139	0.7183	0.3258	0.6609	0.4364	0.3400	0.3706
9	0.8107	0.7419	0.3569	0.6269	0.4549	0.3506	0.3710
Mean	0.8093	0.7401	0.3363	0.6295	0.4382	0.3359	0.3600
SD	0.0052	0.0190	0.0134	0.0243	0.0149	0.0172	0.0186

In [7]:

tuned_rf = tune_model(rf)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8258	0.7863	0.3654	0.7049	0.4813	0.3891	0.4194
1	0.8227	0.7977	0.3541	0.6944	0.4690	0.3758	0.4066
2	0.8233	0.8225	0.3853	0.6766	0.4910	0.3937	0.4165
3	0.8177	0.7713	0.3598	0.6615	0.4661	0.3675	0.3923
4	0.8227	0.7805	0.3513	0.6966	0.4670	0.3743	0.4059
5	0.8227	0.7955	0.3683	0.6842	0.4788	0.3834	0.4101
6	0.8158	0.7568	0.3371	0.6648	0.4474	0.3507	0.3799
7	0.8377	0.7941	0.3768	0.7733	0.5067	0.4231	0.4623
8	0.8227	0.7671	0.3569	0.6923	0.4710	0.3773	0.4073
9	0.8138	0.7833	0.3654	0.6386	0.4649	0.3621	0.3828
Mean	0.8225	0.7855	0.3620	0.6887	0.4743	0.3797	0.4083
SD	0.0062	0.0176	0.0128	0.0339	0.0154	0.0188	0.0220

In [8]:

predict_model(tuned_rf);

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Random Forest Classifier	0.8135	0.7563	0.3245	0.6591	0.4349	0.3383	0.3688

12.0 Finalize Model for Deployment¶

In [9]:

final_rf = finalize_model(tuned_rf)

In [10]:

#Final Random Forest model parameters for deployment
print(final_rf)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=70, n_jobs=-1,
                       oob_score=False, random_state=123, verbose=0,
                       warm_start=False)

In [11]:

predict_model(final_rf);

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Random Forest Classifier	0.8345	0.8222	0.3629	0.7657	0.4924	0.4082	0.4489

13.0 Deploy Model on Microsoft Azure¶

This is the code to deploy model on Microsft azure using pycaret functionalities.

In [12]:

! pip install azure-storage-blob

Requirement already satisfied: azure-storage-blob in /usr/local/lib/python3.6/dist-packages (12.3.2)
Requirement already satisfied: cryptography>=2.1.4 in /usr/local/lib/python3.6/dist-packages (from azure-storage-blob) (3.0)
Requirement already satisfied: msrest>=0.6.10 in /usr/local/lib/python3.6/dist-packages (from azure-storage-blob) (0.6.18)
Requirement already satisfied: azure-core<2.0.0,>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from azure-storage-blob) (1.7.0)
Requirement already satisfied: six>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from cryptography>=2.1.4->azure-storage-blob) (1.15.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in /usr/local/lib/python3.6/dist-packages (from cryptography>=2.1.4->azure-storage-blob) (1.14.1)
Requirement already satisfied: requests~=2.16 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.6.10->azure-storage-blob) (2.23.0)
Requirement already satisfied: requests-oauthlib>=0.5.0 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.6.10->azure-storage-blob) (1.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.6.10->azure-storage-blob) (2020.6.20)
Requirement already satisfied: isodate>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from msrest>=0.6.10->azure-storage-blob) (0.6.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.6/dist-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.1.4->azure-storage-blob) (2.20)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests~=2.16->msrest>=0.6.10->azure-storage-blob) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests~=2.16->msrest>=0.6.10->azure-storage-blob) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests~=2.16->msrest>=0.6.10->azure-storage-blob) (1.24.3)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.5.0->msrest>=0.6.10->azure-storage-blob) (3.1.0)

In [ ]:

## Enter connection string when running in google colab
connect_str = '' #@param {type:"string"}
print(connect_str)

In [14]:

#! export AZURE_STORAGE_CONNECTION_STRING=connect_str

In [21]:

os.environ['AZURE_STORAGE_CONNECTION_STRING']= connect_str

In [ ]:

! echo $AZURE_STORAGE_CONNECTION_STRING

In [ ]:

os.getenv('AZURE_STORAGE_CONNECTION_STRING')

In [25]:

authentication = {'container': 'pycaret-cls-101'}
model_name = 'rf-clf-101'
deploy_model(final_rf, model_name, authentication, platform = 'azure')

Deploying model to Microsoft Azure

Uploading to Azure Storage as blob:
	rf-clf-101.pkl

In [27]:

authentication = {'container': 'pycaret-cls-101'}
model_name = 'rf-clf-101'
model_azure = load_model(model_name, 
               platform = 'azure', 
               authentication = authentication,
               verbose=True)

Loading model from Microsoft Azure

Downloading blob to 
	rf-clf-101.pkl
Blob rf-clf-101.pkl downloaded to rf-clf-101.pkl.
Transformation Pipeline and Model Successfully Loaded

In [29]:

authentication = {'container': 'pycaret-cls-101'}
model_name = 'rf-clf-101'
unseen_predictions = predict_model(model_name, data=data_unseen, platform='azure', authentication=authentication, verbose=True)

Loading model from Microsoft Azure

Downloading blob to 
	rf-clf-101.pkl
Blob rf-clf-101.pkl downloaded to rf-clf-101.pkl.
Transformation Pipeline and Model Successfully Loaded

In [30]:

unseen_predictions

Out[30]:

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_1	PAY_2	PAY_3	PAY_4	PAY_5	PAY_6	BILL_AMT1	BILL_AMT2	BILL_AMT3	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default	Label	Score
0	50000	2	2	1	48	0	0	0	0	0	0	48572.0	45067.0	46492.0	47368.0	7988.0	8011.0	2028.0	2453.0	2329.0	431.0	300.0	500.0	0	0	0.1591
1	200000	2	1	1	40	2	2	2	2	2	2	80468.0	82874.0	84900.0	85758.0	87003.0	89112.0	4200.0	4100.0	3000.0	3400.0	3500.0	0.0	1	1	0.7779
2	50000	2	3	1	44	1	2	3	2	4	3	13112.0	14679.0	15143.0	16892.0	16341.0	15798.0	2100.0	1000.0	2300.0	0.0	0.0	0.0	1	1	0.6478
3	60000	2	2	1	31	2	2	-1	0	0	0	63201.0	56600.0	54952.0	32094.0	31232.0	30384.0	1132.0	60994.0	1436.0	1047.0	1056.0	1053.0	1	1	0.5038
4	120000	2	3	2	32	-1	0	0	0	0	0	66551.0	67876.0	69903.0	71446.0	79589.0	81354.0	2429.0	3120.0	3300.0	10000.0	3200.0	3200.0	0	0	0.1394
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1195	80000	1	2	2	34	2	2	2	2	2	2	72557.0	77708.0	79384.0	77519.0	82607.0	81158.0	7000.0	3500.0	0.0	7000.0	0.0	4000.0	1	1	0.7523
1196	150000	1	3	2	43	-1	-1	-1	-1	0	0	1683.0	1828.0	3502.0	8979.0	5190.0	0.0	1837.0	3526.0	8998.0	129.0	0.0	0.0	0	0	0.1499
1197	30000	1	2	2	37	4	3	2	-1	0	0	3565.0	3356.0	2758.0	20878.0	20582.0	19357.0	0.0	0.0	22000.0	4200.0	2000.0	3100.0	1	0	0.4876
1198	80000	1	3	1	41	1	-1	0	0	0	-1	-1645.0	78379.0	76304.0	52774.0	11855.0	48944.0	85900.0	3409.0	1178.0	1926.0	52964.0	1804.0	1	0	0.2613
1199	50000	1	2	1	46	0	0	0	0	0	0	47929.0	48905.0	49764.0	36535.0	32428.0	15313.0	2078.0	1800.0	1430.0	1000.0	1000.0	1000.0	1	0	0.1569

1200 rows × 26 columns

In [ ]:

13.0 Deploy Model on Google Cloud¶

After the model is finalised and you are happy with the model, you can deploy the model on your cloud of choice. In this section, we deploy the model on the google cloud platform.

In [ ]:

from google.colab import auth
auth.authenticate_user()

In [ ]:

! pip install awscli

In [ ]:

# GCP project name, Change the name based on your own GCP project.
CLOUD_PROJECT = 'gcpessentials-rz' # GCP project name
bucket_name = 'pycaret-clf101-test1' # bucket name for storage of your model
BUCKET = 'gs://' + CLOUD_PROJECT + '-{}'.format(bucket_name)
# Set the gcloud consol to $CLOUD_PROJECT Environment Variable for your Desired Project)
!gcloud config set project $CLOUD_PROJECT

In [ ]:

authentication = {'project': CLOUD_PROJECT, 'bucket' : bucket_name}
model_name = 'rf-clf'
deploy_model(final_rf, model_name, authentication, platform = 'gcp')

In [ ]:

authentication = {'project': CLOUD_PROJECT, 'bucket' : bucket_name}
model_name = 'rf-clf'
model_gcp = load_model(model_name, 
               platform = 'gcp', 
               authentication = authentication,
               verbose=True)

In [ ]:

estimator_ = load_model(model_name, platform='gcp',
                                   authentication=authentication,
                                   verbose=True)

In [ ]:

authentication = {'project': CLOUD_PROJECT, 'bucket' : bucket_name}
model_name = 'rf-clf'
unseen_predictions = predict_model(model_name, data=data_unseen, platform='gcp', authentication=authentication, verbose=True)

In [ ]:

unseen_predictions

In [ ]:

authentication

In [ ]:

import inspect as i
import sys
sys.stdout.write(i.getsource(predict_model))

13.0 Predict on unseen data¶

The predict_model() function is also used to predict on the unseen dataset. The only difference from section 11 above is that this time we will pass the data_unseen parameter. data_unseen is the variable created at the beginning of the tutorial and contains 5% (1200 samples) of the original dataset which was never exposed to PyCaret. (see section 5 for explanation)

In [ ]:

unseen_predictions = predict_model(final_rf, data=data_unseen)
unseen_predictions.head()

The Label and Score columns are added onto the data_unseen set. Label is the prediction and score is the probability of the prediction. Notice that predicted results are concatenated to the original dataset while all the transformations are automatically performed in the background.

14.0 Saving the model¶

We have now finished the experiment by finalizing the tuned_rf model which is now stored in final_rf variable. We have also used the model stored in final_rf to predict data_unseen. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [ ]:

save_model(final_rf,'Final RF Model 08Feb2020')

(TIP : It's always good to use date in the filename when saving models, it's good for version control.)

15.0 Loading the saved model¶

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction.

In [ ]:

saved_final_rf = load_model('Final RF Model 08Feb2020')

Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function. Below we have applied the loaded model to predict the same data_unseen that we used in section 13 above.

In [ ]:

new_prediction = predict_model(saved_final_rf, data=data_unseen)

In [ ]:

new_prediction.head()

Notice that the results of unseen_predictions and new_prediction are identical.

16.0 Wrap-up / Next Steps?¶

This tutorial has covered the entire machine learning pipeline from data ingestion, pre-processing, training the model, hyperparameter tuning, prediction and saving the model for later use. We have completed all of these steps in less than 10 commands which are naturally constructed and very intuitive to remember such as create_model(), tune_model(), compare_models(). Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most libraries.

We have only covered the basics of pycaret.classification. In following tutorials we will go deeper into advanced pre-processing, ensembling, generalized stacking and other techniques that allow you to fully customize your machine learning pipeline and are must know for any data scientist.

See you at the next tutorial. Follow the link to Binary Classification Tutorial (CLF102) - Intermediate Level