Last Update: 08/02/2020
PyCaret Version: 2.0
Author: Alexandre Farias
Email: afarias@tuta.io
This task consists in compare two sentences present on the dataset and identify if both have the same meaning.
An Exploratory Data Analysis is made to gain insights about the data, a Topic Modelling to get the features and the classification step, these last two steps are made with the Python Module PyCaret.
Importing the requires modules, PyCaret is imported later to avoid conflicts on the experiments.
The dataset used on this work is the Text Similarity hosted on Kaggle, only using the data present on the train set, since the test set has a few samples doesn't has the labels do validate the model.
# Standard
import pandas as pd
# Plots
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)
import seaborn as sns
# Sklearn tools
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import *
# PATH and setup
import os
os.chdir("..")
PATH = os.getcwd()+os.sep
RANDOM_SEED = 42
K_FOLDS = 5
The helper functions used to plots, data sampling and scores.
# Dataset Sampling
def data_sampling(dataset, frac: float, random_seed: int):
data_sampled_a = dataset.sample(frac=frac,
random_state=random_seed)
data_sampled_b = dataset.drop(data_sampled_a.index).\
reset_index(drop=True)
data_sampled_a.reset_index(drop=True, inplace=True)
return data_sampled_a, data_sampled_b
# Pie Chart
def pie_plot(data, plot_title: str):
trace = go.Pie(labels=data.index, values=data.values)
layout = go.Layout(autosize=False,
title={"text" : plot_title,
"y" : 0.9,
"x" : 0.5,
"xanchor" : "center",
"yanchor" : "top"})
fig = go.Figure(data=trace, layout=layout)
fig.update_traces(textfont_size=14,
marker=dict(line=dict(color="#000000",
width=2)))
fig.update_yaxes(automargin=True)
iplot(fig)
# Histogram subplots
def histogram_subplot(dataset_a, dataset_b, feature_a: str,
feature_b: str, title: str, title_a:
str, title_b: str):
fig = make_subplots(rows=1, cols=2, subplot_titles=(
title_a,
title_b
)
)
fig.add_trace(go.Histogram(x=dataset_a[feature_a],
showlegend=False), row=1, col=1)
fig.add_trace(go.Histogram(x=dataset_b[feature_b],
showlegend=False), row=1, col=2)
fig.update_layout(template="simple_white")
fig.update_layout(autosize=False,
title={"text" : title,
"y" : 0.9,
"x" : 0.5,
"xanchor" : "center",
"yanchor" : "top"},
yaxis={"title" : "<i>Frequency</i>"})
fig.update_traces(marker=dict(line=dict(color="#000000",
width=2)))
fig.update_yaxes(automargin=True)
iplot(fig)
# Calculate scores with Test/Unseen labeled data
def test_score_report(data_unseen, predict_unseen):
le = LabelEncoder()
data_unseen["Label"] = le.fit_transform(data_unseen.same_security.values)
data_unseen["Label"] = data_unseen["Label"].astype(int)
accuracy = accuracy_score(data_unseen["Label"],
predict_unseen["Label"])
roc_auc = roc_auc_score(data_unseen["Label"],
predict_unseen["Label"])
precision = precision_score(data_unseen["Label"],
predict_unseen["Label"])
recall = recall_score(data_unseen["Label"],
predict_unseen["Label"])
f1 = f1_score(data_unseen["Label"],
predict_unseen["Label"])
df_unseen = pd.DataFrame({
"Accuracy" : [accuracy],
"AUC" : [roc_auc],
"Recall" : [recall],
"Precision" : [precision],
"F1 Score" : [f1]
})
return df_unseen
# Confusion Matrix
def conf_mat(data_unseen, predict_unseen):
unique_label = data_unseen["Label"].unique()
cmtx = pd.DataFrame(
confusion_matrix(data_unseen["Label"],
predict_unseen["Label"], labels=unique_label),
index=['{:}'.format(x) for x in unique_label],
columns=['{:}'.format(x) for x in unique_label])
ax = sns.heatmap(cmtx, annot=True, fmt="d", cmap="YlGnBu")
ax.set_ylabel('Predicted')
ax.set_xlabel('Target');
ax.set_title("Predict Unseen Confusion Matrix", size=14);
The data is loaded in a Dataframe.
data = pd.read_csv(PATH+"data"+os.sep+"train.csv")
data.head(3)
Unnamed: 0 | description_x | description_y | ticker_x | ticker_y | same_security | |
---|---|---|---|---|---|---|
0 | 0 | first trust dow jones internet | first trust dj internet idx | FDN | FDN | True |
1 | 1 | schwab intl large company index etf | schwab strategic tr fundamental intl large co ... | FNDF | FNDF | True |
2 | 2 | vanguard small cap index adm | vanguard small-cap index fund inst | VSMAX | VSCIX | False |
A check for duplicated samples.
data.drop_duplicates(inplace=True, keep='first')
There're no duplicated samples on the dataset.
Time to drop features that will not help on the prediction and get some information about the data.
data.drop(columns=['Unnamed: 0', 'ticker_x', 'ticker_y'],
inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2142 entries, 0 to 2141 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 description_x 2142 non-null object 1 description_y 2142 non-null object 2 same_security 2142 non-null bool dtypes: bool(1), object(2) memory usage: 52.3+ KB
And check for missing values.
data.isnull().sum()
description_x 0 description_y 0 same_security 0 dtype: int64
The dataset doesn't contain any missing value, what is very good! A good approach to compare the sentences is check its characters length, let's create these features for each sentence.
data['len_x'] = data['description_x'].apply(len)
data['len_y'] = data['description_y'].apply(len)
Let's check the distribution of the target feature, same_security
.
series_aux = data['same_security'].value_counts()
pie_plot(series_aux, 'Security Values')
It's a case of imbalanced classes and it will be handled with a model evaluation using F1-Score. I will not use any resampling technique, as random oversampling, SMOTE, etc.
A good insight for this problem is see the distribution of character length of each sentence for the security values, starting with the True values.
data_true = data.query('same_security is True')
histogram_subplot(data_true, data_true, 'len_x', 'len_y',
'<b>Character length distribution for True Security',
'Description X Length', 'Description Y Length')
Sentences that are True on the comparing, have a similar distribution, even on the frequency for the values, for now, seems a good idea use the character length as a feature.
To make this idea better, let's check it for the False values.
data_false = data.query('same_security == False')
histogram_subplot(data_false, data_false, 'len_x', 'len_y',
'<b>Character length distribution for False Security',
'Description X Length', 'Description Y Length')
And for the False values the distribution for each sentence is different, I will use both features about the character length to feed the model.
PyCaret has a NLP module that can automatize most of the boring stuff, like lowering case, remove stop words, stemming, etc. So, a good amount of this part is just setting up PyCaret to run.
Let's import the module.
from pycaret.nlp import *
The PyCaret setup is simple, just enter data, the text target to process and a random seed for the session.
It's possible to add custom stop words, but I will use the default from PyCaret.
Let's start with the text from description_x
.
exp_x = setup(data=data, target='description_x', session_id=RANDOM_SEED)
Description | Value |
---|---|
session_id | 42 |
Documents | 2142 |
Vocab Size | 310 |
Custom Stopwords | False |
There are 310 words present in the vocabulary for sentence_x
.
To get the features for the model, is used a Topic Modelling with PyCaret, for it, the model is tuned (LDA Model) to see what is the best number of topics.
tuned_model_x = tune_model(model='lda', supervised_target='same_security')
IntProgress(value=0, description='Processing: ', max=25)
Output()
Best Model: Latent Dirichlet Allocation | # Topics: 2 | Accuracy : 0.8697
Based on the F1-Score, the model doesn't have a high variation on the score, so I will use 4 topics, to keep it simples.
Time to create a LDA model with 4 topics and see some informations about the topic modelled data.
You can click on each box to see infos like frequency of a word, word cloud, etc.
model_x = create_model('lda', num_topics=4)
The model is assigned to the topic weights and other infos to use after in the classification task.
model_x_results = assign_model(model_x)
Now, all the steps are made again for the description_y
.
exp_y = setup(data=data, target='description_y', session_id=RANDOM_SEED)
Description | Value |
---|---|
session_id | 42 |
Documents | 2142 |
Vocab Size | 310 |
Custom Stopwords | False |
Again, 310 words are present in the vocabulary for sentence_y
.
tuned_model_y = tune_model(model='lda', supervised_target='same_security')
IntProgress(value=0, description='Processing: ', max=25)
Output()
Best Model: Latent Dirichlet Allocation | # Topics: 64 | Accuracy : 0.8441
The result from the tuned model is similar from before, so 4 topics is used again.
model_y = create_model('lda', num_topics=4)
And assign the topics to the model.
model_y_results = assign_model(model_y)
Topic Modelling made and is time to make the classification.
To start, the unused columns are dropped and the results are concatenated in one dataframe.
data_topics = model_x_results.drop(columns=['description_x', 'description_y',
'Dominant_Topic', 'Perc_Dominant_Topic'])
data_topics['Topic_0_y'] = model_y_results['Topic_0']
data_topics['Topic_1_y'] = model_y_results['Topic_1']
data_topics['Topic_2_y'] = model_y_results['Topic_2']
data_topics['Topic_3_y'] = model_y_results['Topic_3']
To avoid errors with PyCaret on the label feature, same_security
is converted to object and the labels are renamed to Yes
for True
and No
for False
.
10% of the data is sampled to be used as unseen data to validate the final model.
from pycaret.classification import *
data_topics['same_security'] = data_topics['same_security'].astype('str')
data_topics_dict = {'True' : 'Yes', 'False' : 'No'}
data_topics['same_security'] = data_topics['same_security'].replace(data_topics_dict)
train, unseen = data_sampling(data_topics, 0.9, RANDOM_SEED)
data_topics.head(5)
same_security | len_x | len_y | Topic_0 | Topic_1 | Topic_2 | Topic_3 | Topic_0_y | Topic_1_y | Topic_2_y | Topic_3_y | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Yes | 30 | 27 | 0.272233 | 0.162907 | 0.156305 | 0.408555 | 0.056694 | 0.123293 | 0.066383 | 0.753630 |
1 | Yes | 35 | 55 | 0.088963 | 0.053199 | 0.219810 | 0.638028 | 0.460700 | 0.177548 | 0.189938 | 0.171814 |
2 | No | 28 | 34 | 0.387918 | 0.080165 | 0.076917 | 0.455000 | 0.056584 | 0.745060 | 0.066187 | 0.132170 |
3 | Yes | 53 | 60 | 0.272233 | 0.162907 | 0.156305 | 0.408555 | 0.034559 | 0.320199 | 0.578683 | 0.066559 |
4 | Yes | 16 | 9 | 0.272233 | 0.162907 | 0.156305 | 0.408555 | 0.161175 | 0.347640 | 0.188525 | 0.302660 |
The same setting up from before, but now PyCaret split the dataset in 70% for Train and 30% for Test.
No feature transformation, scalling or normalization are used.
exp_clf = setup(data=train,
target='same_security',
session_id=RANDOM_SEED,
experiment_name="text-similarity",
log_experiment=True,
silent=True)
Setup Succesfully Completed!
Description | Value | |
---|---|---|
0 | session_id | 42 |
1 | Target Type | Binary |
2 | Label Encoded | No: 0, Yes: 1 |
3 | Original Data | (1928, 11) |
4 | Missing Values | False |
5 | Numeric Features | 10 |
6 | Categorical Features | 0 |
7 | Ordinal Features | False |
8 | High Cardinality Features | False |
9 | High Cardinality Method | None |
10 | Sampled Data | (1928, 11) |
11 | Transformed Train Set | (1349, 10) |
12 | Transformed Test Set | (579, 10) |
13 | Numeric Imputer | mean |
14 | Categorical Imputer | constant |
15 | Normalize | False |
16 | Normalize Method | None |
17 | Transformation | False |
18 | Transformation Method | None |
19 | PCA | False |
20 | PCA Method | None |
21 | PCA Components | None |
22 | Ignore Low Variance | False |
23 | Combine Rare Levels | False |
24 | Rare Level Threshold | None |
25 | Numeric Binning | False |
26 | Remove Outliers | False |
27 | Outliers Threshold | None |
28 | Remove Multicollinearity | False |
29 | Multicollinearity Threshold | None |
30 | Clustering | False |
31 | Clustering Iteration | None |
32 | Polynomial Features | False |
33 | Polynomial Degree | None |
34 | Trignometry Features | False |
35 | Polynomial Threshold | None |
36 | Group Features | False |
37 | Feature Selection | False |
38 | Features Selection Threshold | None |
39 | Feature Interaction | False |
40 | Feature Ratio | False |
41 | Interaction Threshold | None |
42 | Fix Imbalance | False |
43 | Fix Imbalance Method | SMOTE |
Compare the models sorted by F1-Score to get the best.`
top_model = compare_models(sort='F1',
fold=K_FOLDS,
n_select=3)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
0 | CatBoost Classifier | 0.8480 | 0.8613 | 0.9397 | 0.8687 | 0.9026 | 0.5585 | 0.5692 | 7.4354 |
1 | Extreme Gradient Boosting | 0.8480 | 0.8696 | 0.9159 | 0.8857 | 0.9003 | 0.5804 | 0.5833 | 0.1665 |
2 | Light Gradient Boosting Machine | 0.8436 | 0.8665 | 0.9179 | 0.8791 | 0.8979 | 0.5638 | 0.5681 | 0.2272 |
3 | Gradient Boosting Classifier | 0.8384 | 0.8578 | 0.9367 | 0.8602 | 0.8967 | 0.5283 | 0.5397 | 0.2618 |
4 | Extra Trees Classifier | 0.8221 | 0.8237 | 0.9021 | 0.8664 | 0.8838 | 0.5051 | 0.5076 | 0.3498 |
5 | Ada Boost Classifier | 0.8073 | 0.8133 | 0.9179 | 0.8398 | 0.8770 | 0.4351 | 0.4454 | 0.1626 |
6 | Random Forest Classifier | 0.8028 | 0.8187 | 0.8635 | 0.8722 | 0.8675 | 0.4810 | 0.4821 | 0.1271 |
7 | Ridge Classifier | 0.7621 | 0.0000 | 0.9634 | 0.7742 | 0.8584 | 0.1636 | 0.2199 | 0.0085 |
8 | Logistic Regression | 0.7583 | 0.7011 | 0.9594 | 0.7728 | 0.8561 | 0.1538 | 0.2025 | 0.0924 |
9 | Linear Discriminant Analysis | 0.7554 | 0.6951 | 0.9376 | 0.7802 | 0.8516 | 0.1859 | 0.2202 | 0.0165 |
10 | Decision Tree Classifier | 0.7687 | 0.7171 | 0.8230 | 0.8626 | 0.8421 | 0.4105 | 0.4129 | 0.0116 |
11 | K Neighbors Classifier | 0.7398 | 0.6488 | 0.8843 | 0.7927 | 0.8359 | 0.2162 | 0.2237 | 0.0083 |
12 | Quadratic Discriminant Analysis | 0.7494 | 0.7744 | 0.8249 | 0.8383 | 0.8315 | 0.3432 | 0.3436 | 0.0078 |
13 | Naive Bayes | 0.7324 | 0.7417 | 0.8268 | 0.8177 | 0.8220 | 0.2817 | 0.2825 | 0.0054 |
14 | SVM - Linear Kernel | 0.6114 | 0.0000 | 0.6576 | 0.8290 | 0.6588 | 0.1025 | 0.1357 | 0.0150 |
The best model by F1-Score is the Catboost Classifier, but the difference from XGBoost is minimal and this model got a better score on the other metrics, let's take XGBoost as the base model. Tune the model to see if can get any improvement.
tuned_model = tune_model(top_model[1], optimize='F1',
choose_better=True, fold=K_FOLDS);
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8444 | 0.8865 | 0.9163 | 0.8815 | 0.8986 | 0.5658 | 0.5677 |
1 | 0.8444 | 0.8241 | 0.9109 | 0.8846 | 0.8976 | 0.5748 | 0.5758 |
2 | 0.8519 | 0.8944 | 0.9109 | 0.8932 | 0.9020 | 0.5990 | 0.5995 |
3 | 0.8407 | 0.8421 | 0.9406 | 0.8597 | 0.8983 | 0.5342 | 0.5459 |
4 | 0.8216 | 0.8442 | 0.9059 | 0.8632 | 0.8841 | 0.4979 | 0.5007 |
Mean | 0.8406 | 0.8583 | 0.9169 | 0.8765 | 0.8961 | 0.5543 | 0.5579 |
SD | 0.0102 | 0.0273 | 0.0123 | 0.0129 | 0.0062 | 0.0350 | 0.0334 |
The tuned model doesn't get any improvement, so the base model is the best.
Time to build a Bagging Ensemble.
bagged_model = ensemble_model(tuned_model, optimize="F1",
fold=K_FOLDS)
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8593 | 0.8879 | 0.9360 | 0.8837 | 0.9091 | 0.5987 | 0.6036 |
1 | 0.8444 | 0.8363 | 0.9208 | 0.8774 | 0.8986 | 0.5660 | 0.5691 |
2 | 0.8556 | 0.8987 | 0.9356 | 0.8791 | 0.9065 | 0.5908 | 0.5963 |
3 | 0.8370 | 0.8463 | 0.9505 | 0.8496 | 0.8972 | 0.5102 | 0.5295 |
4 | 0.8550 | 0.8693 | 0.9257 | 0.8863 | 0.9056 | 0.5942 | 0.5968 |
Mean | 0.8503 | 0.8677 | 0.9337 | 0.8752 | 0.9034 | 0.5720 | 0.5791 |
SD | 0.0083 | 0.0237 | 0.0102 | 0.0132 | 0.0047 | 0.0329 | 0.0275 |
And now a Boosting Ensemble.
boosted_model = ensemble_model(tuned_model, optimize="F1",
fold=K_FOLDS, method="Boosting")
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.7519 | 0.5000 | 1.0000 | 0.7519 | 0.8584 | 0.0000 | 0.0000 |
1 | 0.7481 | 0.5000 | 1.0000 | 0.7481 | 0.8559 | 0.0000 | 0.0000 |
2 | 0.7481 | 0.5000 | 1.0000 | 0.7481 | 0.8559 | 0.0000 | 0.0000 |
3 | 0.7481 | 0.5000 | 1.0000 | 0.7481 | 0.8559 | 0.0000 | 0.0000 |
4 | 0.7509 | 0.5000 | 1.0000 | 0.7509 | 0.8577 | 0.0000 | 0.0000 |
Mean | 0.7494 | 0.5000 | 1.0000 | 0.7494 | 0.8568 | 0.0000 | 0.0000 |
SD | 0.0016 | 0.0000 | 0.0000 | 0.0016 | 0.0011 | 0.0000 | 0.0000 |
The Bagged Model is the best and is saved as the best model and used to predict on the test set.
best_model = bagged_model
predict_model(best_model);
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Bagging Classifier | 0.8601 | 0.8719 | 0.924 | 0.8931 | 0.9083 | 0.6141 | 0.6157 |
Let's check the model hyperparameters.
plot_model(best_model, plot="parameter")
Parameters | |
---|---|
base_estimator__objective | binary:logistic |
base_estimator__base_score | 0.5 |
base_estimator__booster | gbtree |
base_estimator__colsample_bylevel | 1 |
base_estimator__colsample_bynode | 1 |
base_estimator__colsample_bytree | 1 |
base_estimator__gamma | 0 |
base_estimator__gpu_id | -1 |
base_estimator__importance_type | gain |
base_estimator__interaction_constraints | |
base_estimator__learning_rate | 0.3 |
base_estimator__max_delta_step | 0 |
base_estimator__max_depth | 6 |
base_estimator__min_child_weight | 1 |
base_estimator__missing | NaN |
base_estimator__monotone_constraints | () |
base_estimator__n_estimators | 100 |
base_estimator__n_jobs | -1 |
base_estimator__num_parallel_tree | 1 |
base_estimator__random_state | 42 |
base_estimator__reg_alpha | 0 |
base_estimator__reg_lambda | 1 |
base_estimator__scale_pos_weight | 1 |
base_estimator__subsample | 1 |
base_estimator__tree_method | exact |
base_estimator__validate_parameters | 1 |
base_estimator__verbosity | 0 |
base_estimator | XGBClassifier(base_score=0.5, booster='gbtree'... |
bootstrap | True |
bootstrap_features | False |
max_features | 1 |
max_samples | 1 |
n_estimators | 10 |
n_jobs | -1 |
oob_score | False |
random_state | 42 |
verbose | 0 |
warm_start | False |
Now, plot the AUC Score.
plot_model(best_model, plot="auc")
The AUC Score for both classes was good, 0.87.
Now, the confusion matrix and class report.
plot_model(best_model, plot="confusion_matrix")
plot_model(best_model, plot="class_report")
And the results are excellent for the class True (1), but good for the class False (0).
As the test data is well fitted on the model, let's use it to fit a final model.
final_model = finalize_model(best_model)
INFO:logs:Creating MLFlow logs INFO:logs:SubProcess create_model() called ================================== INFO:logs:Initializing create_model() INFO:logs:create_model(estimator=BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False), ensemble=False, method=None, fold=10, round=4, cross_validation=True, verbose=False, system=False) INFO:logs:Checking exceptions INFO:logs:Preloading libraries INFO:logs:Preparing display monitor INFO:logs:Copying training dataset INFO:logs:Importing libraries INFO:logs:Defining folds INFO:logs:Declaring metric variables INFO:logs:Importing untrained model INFO:logs:Declaring custom model INFO:logs:Bagging Classifier Imported successfully INFO:logs:Checking ensemble method INFO:logs:Initializing Fold 1 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 2 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 3 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 4 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 5 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 6 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 7 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 8 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 9 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Initializing Fold 10 INFO:logs:Fitting Model INFO:logs:Evaluating Metrics INFO:logs:Compiling Metrics INFO:logs:Calculating mean and std INFO:logs:Creating metrics dataframe INFO:logs:Finalizing model INFO:logs:Uploading results into container INFO:logs:Uploading model into container now INFO:logs:create_model_container: 17 INFO:logs:master_model_container: 17 INFO:logs:display_container: 21 INFO:logs:BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False) INFO:logs:create_model() successfully completed...................................... INFO:logs:SubProcess create_model() end ================================== INFO:logs:SubProcess save_model() called ================================== INFO:logs:Initializing save_model() INFO:logs:save_model(model=BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False), model_name=Trained Model, verbose=False) INFO:logs:Appending prep pipeline INFO:logs:Trained Model.pkl saved in current working directory INFO:logs:[Pipeline(memory=None, steps=[('dtypes', DataTypes_Auto_infer(categorical_features=[], display_types=False, features_todrop=[], ml_usecase='classification', numerical_features=[], target='same_security', time_features=[])), ('imputer', Simple_Imputer(categorical_strategy='not_available', numeric_strategy='mean', target_variable=None)), ('new_levels1', New_Catagor... ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()), ('P_transform', Empty()), ('pt_target', Empty()), ('binn', Empty()), ('rem_outliers', Empty()), ('cluster_all', Empty()), ('dummy', Dummify(target='same_security')), ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()), ('feature_select', Empty()), ('fix_multi', Empty()), ('dfs', Empty()), ('pca', Empty())], verbose=False), BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False)] INFO:logs:save_model() successfully completed...................................... INFO:logs:SubProcess save_model() end ================================== INFO:logs:create_model_container: 17 INFO:logs:master_model_container: 17 INFO:logs:display_container: 21 INFO:logs:BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, num_parallel_tree=1, objective='binary:logistic', random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=0), bootstrap=True, bootstrap_features=False, max_features=1.0, max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False) INFO:logs:finalize_model() successfully completed......................................
To validadte the model, let's see the prediction with unseen data, which was not included on the final model fit.
predict_unseen = predict_model(final_model, data=unseen);
score_unseen = test_score_report(unseen, predict_unseen)
print(score_unseen.to_string(index=False))
conf_mat(unseen, predict_unseen);
Accuracy AUC Recall Precision F1 Score 0.878505 0.780538 0.952381 0.898876 0.924855
So, the unseen data was well predicted on the model, with a high F1-Score showing a good balance between Recall and precision.
The following conclusions can be drawed:
sentence_y
is important to model prediction performance.