Text Similarity Classification¶

Last Update: 08/02/2020
PyCaret Version: 2.0
Author: Alexandre Farias
Email: afarias@tuta.io

Introduction¶

This task consists in compare two sentences present on the dataset and identify if both have the same meaning.
An Exploratory Data Analysis is made to gain insights about the data, a Topic Modelling to get the features and the classification step, these last two steps are made with the Python Module PyCaret.
Importing the requires modules, PyCaret is imported later to avoid conflicts on the experiments. The dataset used on this work is the Text Similarity hosted on Kaggle, only using the data present on the train set, since the test set has a few samples doesn't has the labels do validate the model.

In [1]:

# Standard
import pandas as pd
# Plots
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objects as go
from plotly.subplots import make_subplots
init_notebook_mode(connected=True)
import seaborn as sns
# Sklearn tools
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import *
# PATH and setup
import os
os.chdir("..")
PATH = os.getcwd()+os.sep
RANDOM_SEED = 42
K_FOLDS = 5

The helper functions used to plots, data sampling and scores.

In [2]:

# Dataset Sampling
def data_sampling(dataset, frac: float, random_seed: int):
    data_sampled_a = dataset.sample(frac=frac,
                                    random_state=random_seed)
    data_sampled_b =  dataset.drop(data_sampled_a.index).\
    reset_index(drop=True)
    data_sampled_a.reset_index(drop=True, inplace=True)
    return data_sampled_a, data_sampled_b  
# Pie Chart
def pie_plot(data, plot_title: str):
    trace = go.Pie(labels=data.index, values=data.values)
    layout = go.Layout(autosize=False,
    title={"text" : plot_title,
            "y" : 0.9,
            "x" : 0.5,
            "xanchor" : "center",
            "yanchor" : "top"})
    fig = go.Figure(data=trace, layout=layout)
    fig.update_traces(textfont_size=14,
                      marker=dict(line=dict(color="#000000",
                      width=2)))
    fig.update_yaxes(automargin=True)
    iplot(fig)
# Histogram subplots
def histogram_subplot(dataset_a, dataset_b, feature_a: str,
feature_b: str, title: str, title_a:
str, title_b: str):
    fig = make_subplots(rows=1, cols=2, subplot_titles=(
    title_a,
    title_b
    )
    )
    fig.add_trace(go.Histogram(x=dataset_a[feature_a],
    showlegend=False), row=1, col=1)
    fig.add_trace(go.Histogram(x=dataset_b[feature_b],
    showlegend=False), row=1, col=2)
    fig.update_layout(template="simple_white")
    fig.update_layout(autosize=False,
    title={"text" : title,
            "y" : 0.9,
            "x" : 0.5,
            "xanchor" : "center",
            "yanchor" : "top"},
            yaxis={"title" : "<i>Frequency</i>"})
    fig.update_traces(marker=dict(line=dict(color="#000000",
    width=2)))
    fig.update_yaxes(automargin=True)
    iplot(fig)
# Calculate scores with Test/Unseen labeled data
def test_score_report(data_unseen, predict_unseen):
    le = LabelEncoder()
    data_unseen["Label"] = le.fit_transform(data_unseen.same_security.values)
    data_unseen["Label"] = data_unseen["Label"].astype(int)
    accuracy = accuracy_score(data_unseen["Label"],
    predict_unseen["Label"])
    roc_auc = roc_auc_score(data_unseen["Label"],
    predict_unseen["Label"])
    precision = precision_score(data_unseen["Label"],
    predict_unseen["Label"])
    recall = recall_score(data_unseen["Label"],
    predict_unseen["Label"])
    f1 = f1_score(data_unseen["Label"],
    predict_unseen["Label"])
    df_unseen = pd.DataFrame({
    "Accuracy" : [accuracy],
    "AUC" : [roc_auc],
    "Recall" : [recall],
    "Precision" : [precision],
    "F1 Score" : [f1]
    })
    return df_unseen
# Confusion Matrix
def conf_mat(data_unseen, predict_unseen):
    unique_label = data_unseen["Label"].unique()
    cmtx = pd.DataFrame(
    confusion_matrix(data_unseen["Label"],
                    predict_unseen["Label"], labels=unique_label),
                    index=['{:}'.format(x) for x in unique_label],
                    columns=['{:}'.format(x) for x in unique_label])
    ax = sns.heatmap(cmtx, annot=True, fmt="d", cmap="YlGnBu")
    ax.set_ylabel('Predicted')
    ax.set_xlabel('Target');
    ax.set_title("Predict Unseen Confusion Matrix", size=14);

1. Data Loading and Initial Infos¶

The data is loaded in a Dataframe.

In [3]:

data = pd.read_csv(PATH+"data"+os.sep+"train.csv")
data.head(3)

Out[3]:

	Unnamed: 0	description_x	description_y	ticker_x	ticker_y	same_security
0	0	first trust dow jones internet	first trust dj internet idx	FDN	FDN	True
1	1	schwab intl large company index etf	schwab strategic tr fundamental intl large co ...	FNDF	FNDF	True
2	2	vanguard small cap index adm	vanguard small-cap index fund inst	VSMAX	VSCIX	False

A check for duplicated samples.

In [4]:

data.drop_duplicates(inplace=True, keep='first')

There're no duplicated samples on the dataset.
Time to drop features that will not help on the prediction and get some information about the data.

In [5]:

data.drop(columns=['Unnamed: 0', 'ticker_x', 'ticker_y'],
          inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2142 entries, 0 to 2141
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   description_x  2142 non-null   object
 1   description_y  2142 non-null   object
 2   same_security  2142 non-null   bool  
dtypes: bool(1), object(2)
memory usage: 52.3+ KB

And check for missing values.

In [6]:

data.isnull().sum()

Out[6]:

description_x    0
description_y    0
same_security    0
dtype: int64

The dataset doesn't contain any missing value, what is very good! A good approach to compare the sentences is check its characters length, let's create these features for each sentence.

In [7]:

data['len_x'] = data['description_x'].apply(len)
data['len_y'] = data['description_y'].apply(len)

2. EDA¶

Let's check the distribution of the target feature, same_security.

In [8]:

series_aux = data['same_security'].value_counts()
pie_plot(series_aux, 'Security Values')

It's a case of imbalanced classes and it will be handled with a model evaluation using F1-Score. I will not use any resampling technique, as random oversampling, SMOTE, etc.
A good insight for this problem is see the distribution of character length of each sentence for the security values, starting with the True values.

In [9]:

data_true = data.query('same_security is True')
histogram_subplot(data_true, data_true, 'len_x', 'len_y', 
                  '<b>Character length distribution for True Security', 
                  'Description X Length', 'Description Y Length')

Sentences that are True on the comparing, have a similar distribution, even on the frequency for the values, for now, seems a good idea use the character length as a feature.
To make this idea better, let's check it for the False values.

In [10]:

data_false = data.query('same_security == False')
histogram_subplot(data_false, data_false, 'len_x', 'len_y', 
                  '<b>Character length distribution for False Security', 
                  'Description X Length', 'Description Y Length')

And for the False values the distribution for each sentence is different, I will use both features about the character length to feed the model.

3. Model Build¶

PyCaret has a NLP module that can automatize most of the boring stuff, like lowering case, remove stop words, stemming, etc. So, a good amount of this part is just setting up PyCaret to run.
Let's import the module.

In [11]:

from pycaret.nlp import *

The PyCaret setup is simple, just enter data, the text target to process and a random seed for the session.
It's possible to add custom stop words, but I will use the default from PyCaret.
Let's start with the text from description_x.

In [12]:

exp_x = setup(data=data, target='description_x', session_id=RANDOM_SEED)

Description	Value
session_id	42
Documents	2142
Vocab Size	310
Custom Stopwords	False

There are 310 words present in the vocabulary for sentence_x.
To get the features for the model, is used a Topic Modelling with PyCaret, for it, the model is tuned (LDA Model) to see what is the best number of topics.

In [13]:

tuned_model_x = tune_model(model='lda', supervised_target='same_security')

IntProgress(value=0, description='Processing: ', max=25)

Output()

Best Model: Latent Dirichlet Allocation | # Topics: 2 | Accuracy : 0.8697

Based on the F1-Score, the model doesn't have a high variation on the score, so I will use 4 topics, to keep it simples.
Time to create a LDA model with 4 topics and see some informations about the topic modelled data.
You can click on each box to see infos like frequency of a word, word cloud, etc.

In [14]:

model_x = create_model('lda', num_topics=4)

The model is assigned to the topic weights and other infos to use after in the classification task.

In [18]:

model_x_results = assign_model(model_x)

Now, all the steps are made again for the description_y.

In [19]:

exp_y = setup(data=data, target='description_y', session_id=RANDOM_SEED)

Description	Value
session_id	42
Documents	2142
Vocab Size	310
Custom Stopwords	False

Again, 310 words are present in the vocabulary for sentence_y.

In [20]:

tuned_model_y = tune_model(model='lda', supervised_target='same_security')

IntProgress(value=0, description='Processing: ', max=25)

Output()

Best Model: Latent Dirichlet Allocation | # Topics: 64 | Accuracy : 0.8441

The result from the tuned model is similar from before, so 4 topics is used again.

In [21]:

model_y = create_model('lda', num_topics=4)

And assign the topics to the model.

In [22]:

model_y_results = assign_model(model_y)

Topic Modelling made and is time to make the classification.

4. Classification¶

To start, the unused columns are dropped and the results are concatenated in one dataframe.

In [23]:

data_topics = model_x_results.drop(columns=['description_x', 'description_y', 
                                         'Dominant_Topic', 'Perc_Dominant_Topic'])
data_topics['Topic_0_y'] = model_y_results['Topic_0']
data_topics['Topic_1_y'] = model_y_results['Topic_1']
data_topics['Topic_2_y'] = model_y_results['Topic_2']
data_topics['Topic_3_y'] = model_y_results['Topic_3']

To avoid errors with PyCaret on the label feature, same_security is converted to object and the labels are renamed to Yes for True and No for False.
10% of the data is sampled to be used as unseen data to validate the final model.

In [24]:

from pycaret.classification import *

data_topics['same_security'] = data_topics['same_security'].astype('str')
data_topics_dict = {'True' : 'Yes',  'False' : 'No'}
data_topics['same_security'] = data_topics['same_security'].replace(data_topics_dict)
train, unseen = data_sampling(data_topics, 0.9, RANDOM_SEED)
data_topics.head(5)

Out[24]:

	same_security	len_x	len_y	Topic_0	Topic_1	Topic_2	Topic_3	Topic_0_y	Topic_1_y	Topic_2_y	Topic_3_y
0	Yes	30	27	0.272233	0.162907	0.156305	0.408555	0.056694	0.123293	0.066383	0.753630
1	Yes	35	55	0.088963	0.053199	0.219810	0.638028	0.460700	0.177548	0.189938	0.171814
2	No	28	34	0.387918	0.080165	0.076917	0.455000	0.056584	0.745060	0.066187	0.132170
3	Yes	53	60	0.272233	0.162907	0.156305	0.408555	0.034559	0.320199	0.578683	0.066559
4	Yes	16	9	0.272233	0.162907	0.156305	0.408555	0.161175	0.347640	0.188525	0.302660

The same setting up from before, but now PyCaret split the dataset in 70% for Train and 30% for Test.
No feature transformation, scalling or normalization are used.

In [27]:

exp_clf = setup(data=train, 
                target='same_security',
                session_id=RANDOM_SEED, 
                experiment_name="text-similarity",
                log_experiment=True,
                silent=True)

Setup Succesfully Completed!

	Description	Value
0	session_id	42
1	Target Type	Binary
2	Label Encoded	No: 0, Yes: 1
3	Original Data	(1928, 11)
4	Missing Values	False
5	Numeric Features	10
6	Categorical Features	0
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(1928, 11)
11	Transformed Train Set	(1349, 10)
12	Transformed Test Set	(579, 10)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	False
16	Normalize Method	None
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None
42	Fix Imbalance	False
43	Fix Imbalance Method	SMOTE

Compare the models sorted by F1-Score to get the best.`

In [28]:

top_model = compare_models(sort='F1',
                           fold=K_FOLDS,
                           n_select=3)

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
0	CatBoost Classifier	0.8480	0.8613	0.9397	0.8687	0.9026	0.5585	0.5692	7.4354
1	Extreme Gradient Boosting	0.8480	0.8696	0.9159	0.8857	0.9003	0.5804	0.5833	0.1665
2	Light Gradient Boosting Machine	0.8436	0.8665	0.9179	0.8791	0.8979	0.5638	0.5681	0.2272
3	Gradient Boosting Classifier	0.8384	0.8578	0.9367	0.8602	0.8967	0.5283	0.5397	0.2618
4	Extra Trees Classifier	0.8221	0.8237	0.9021	0.8664	0.8838	0.5051	0.5076	0.3498
5	Ada Boost Classifier	0.8073	0.8133	0.9179	0.8398	0.8770	0.4351	0.4454	0.1626
6	Random Forest Classifier	0.8028	0.8187	0.8635	0.8722	0.8675	0.4810	0.4821	0.1271
7	Ridge Classifier	0.7621	0.0000	0.9634	0.7742	0.8584	0.1636	0.2199	0.0085
8	Logistic Regression	0.7583	0.7011	0.9594	0.7728	0.8561	0.1538	0.2025	0.0924
9	Linear Discriminant Analysis	0.7554	0.6951	0.9376	0.7802	0.8516	0.1859	0.2202	0.0165
10	Decision Tree Classifier	0.7687	0.7171	0.8230	0.8626	0.8421	0.4105	0.4129	0.0116
11	K Neighbors Classifier	0.7398	0.6488	0.8843	0.7927	0.8359	0.2162	0.2237	0.0083
12	Quadratic Discriminant Analysis	0.7494	0.7744	0.8249	0.8383	0.8315	0.3432	0.3436	0.0078
13	Naive Bayes	0.7324	0.7417	0.8268	0.8177	0.8220	0.2817	0.2825	0.0054
14	SVM - Linear Kernel	0.6114	0.0000	0.6576	0.8290	0.6588	0.1025	0.1357	0.0150

The best model by F1-Score is the Catboost Classifier, but the difference from XGBoost is minimal and this model got a better score on the other metrics, let's take XGBoost as the base model. Tune the model to see if can get any improvement.

In [31]:

tuned_model = tune_model(top_model[1], optimize='F1',
                         choose_better=True, fold=K_FOLDS);

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8444	0.8865	0.9163	0.8815	0.8986	0.5658	0.5677
1	0.8444	0.8241	0.9109	0.8846	0.8976	0.5748	0.5758
2	0.8519	0.8944	0.9109	0.8932	0.9020	0.5990	0.5995
3	0.8407	0.8421	0.9406	0.8597	0.8983	0.5342	0.5459
4	0.8216	0.8442	0.9059	0.8632	0.8841	0.4979	0.5007
Mean	0.8406	0.8583	0.9169	0.8765	0.8961	0.5543	0.5579
SD	0.0102	0.0273	0.0123	0.0129	0.0062	0.0350	0.0334

The tuned model doesn't get any improvement, so the base model is the best.
Time to build a Bagging Ensemble.

In [48]:

bagged_model = ensemble_model(tuned_model, optimize="F1",
                              fold=K_FOLDS) 

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	0.8593	0.8879	0.9360	0.8837	0.9091	0.5987	0.6036
1	0.8444	0.8363	0.9208	0.8774	0.8986	0.5660	0.5691
2	0.8556	0.8987	0.9356	0.8791	0.9065	0.5908	0.5963
3	0.8370	0.8463	0.9505	0.8496	0.8972	0.5102	0.5295
4	0.8550	0.8693	0.9257	0.8863	0.9056	0.5942	0.5968
Mean	0.8503	0.8677	0.9337	0.8752	0.9034	0.5720	0.5791
SD	0.0083	0.0237	0.0102	0.0132	0.0047	0.0329	0.0275

And now a Boosting Ensemble.

In [49]:

boosted_model = ensemble_model(tuned_model, optimize="F1",
                               fold=K_FOLDS, method="Boosting") 

	Accuracy	AUC	Recall	Prec.	F1
0	0.7519	0.5000	1.0000	0.7519	0.8584
1	0.7481	0.5000	1.0000	0.7481	0.8559
2	0.7481	0.5000	1.0000	0.7481	0.8559
3	0.7481	0.5000	1.0000	0.7481	0.8559
4	0.7509	0.5000	1.0000	0.7509	0.8577
Mean	0.7494	0.5000	1.0000	0.7494	0.8568
SD	0.0016	0.0000	0.0000	0.0016	0.0011

The Bagged Model is the best and is saved as the best model and used to predict on the test set.

In [60]:

best_model = bagged_model
predict_model(best_model);

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
0	Bagging Classifier	0.8601	0.8719	0.924	0.8931	0.9083	0.6141	0.6157

Let's check the model hyperparameters.

In [62]:

plot_model(best_model, plot="parameter")

	Parameters
base_estimator__objective	binary:logistic
base_estimator__base_score	0.5
base_estimator__booster	gbtree
base_estimator__colsample_bylevel	1
base_estimator__colsample_bynode	1
base_estimator__colsample_bytree	1
base_estimator__gamma	0
base_estimator__gpu_id	-1
base_estimator__importance_type	gain
base_estimator__interaction_constraints
base_estimator__learning_rate	0.3
base_estimator__max_delta_step	0
base_estimator__max_depth	6
base_estimator__min_child_weight	1
base_estimator__missing	NaN
base_estimator__monotone_constraints	()
base_estimator__n_estimators	100
base_estimator__n_jobs	-1
base_estimator__num_parallel_tree	1
base_estimator__random_state	42
base_estimator__reg_alpha	0
base_estimator__reg_lambda	1
base_estimator__scale_pos_weight	1
base_estimator__subsample	1
base_estimator__tree_method	exact
base_estimator__validate_parameters	1
base_estimator__verbosity	0
base_estimator	XGBClassifier(base_score=0.5, booster='gbtree'...
bootstrap	True
bootstrap_features	False
max_features	1
max_samples	1
n_estimators	10
n_jobs	-1
oob_score	False
random_state	42
verbose	0
warm_start	False

Now, plot the AUC Score.

In [63]:

plot_model(best_model, plot="auc")

The AUC Score for both classes was good, 0.87.
Now, the confusion matrix and class report.

In [65]:

plot_model(best_model, plot="confusion_matrix")

In [66]:

plot_model(best_model, plot="class_report")

And the results are excellent for the class True (1), but good for the class False (0).
As the test data is well fitted on the model, let's use it to fit a final model.

In [72]:

final_model = finalize_model(best_model)

INFO:logs:Creating MLFlow logs
INFO:logs:SubProcess create_model() called ==================================
INFO:logs:Initializing create_model()
INFO:logs:create_model(estimator=BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                               colsample_bylevel=1,
                                               colsample_bynode=1,
                                               colsample_bytree=1, gamma=0,
                                               gpu_id=-1,
                                               importance_type='gain',
                                               interaction_constraints='',
                                               learning_rate=0.300000012,
                                               max_delta_step=0, max_depth=6,
                                               min_child_weight=1, missing=nan,
                                               monotone_constraints='()',
                                               n_estimators=100, n_jobs=-1,
                                               num_parallel_tree=1,
                                               objective='binary:logistic',
                                               random_state=42, reg_alpha=0,
                                               reg_lambda=1, scale_pos_weight=1,
                                               subsample=1, tree_method='exact',
                                               validate_parameters=1,
                                               verbosity=0),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False,
                  random_state=42, verbose=0, warm_start=False), ensemble=False, method=None, fold=10, round=4, cross_validation=True, verbose=False, system=False)
INFO:logs:Checking exceptions
INFO:logs:Preloading libraries
INFO:logs:Preparing display monitor
INFO:logs:Copying training dataset
INFO:logs:Importing libraries
INFO:logs:Defining folds
INFO:logs:Declaring metric variables
INFO:logs:Importing untrained model
INFO:logs:Declaring custom model
INFO:logs:Bagging Classifier Imported successfully
INFO:logs:Checking ensemble method
INFO:logs:Initializing Fold 1
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 2
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 3
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 4
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 5
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 6
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 7
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 8
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 9
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Initializing Fold 10
INFO:logs:Fitting Model
INFO:logs:Evaluating Metrics
INFO:logs:Compiling Metrics
INFO:logs:Calculating mean and std
INFO:logs:Creating metrics dataframe
INFO:logs:Finalizing model
INFO:logs:Uploading results into container
INFO:logs:Uploading model into container now
INFO:logs:create_model_container: 17
INFO:logs:master_model_container: 17
INFO:logs:display_container: 21
INFO:logs:BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                               colsample_bylevel=1,
                                               colsample_bynode=1,
                                               colsample_bytree=1, gamma=0,
                                               gpu_id=-1,
                                               importance_type='gain',
                                               interaction_constraints='',
                                               learning_rate=0.300000012,
                                               max_delta_step=0, max_depth=6,
                                               min_child_weight=1, missing=nan,
                                               monotone_constraints='()',
                                               n_estimators=100, n_jobs=-1,
                                               num_parallel_tree=1,
                                               objective='binary:logistic',
                                               random_state=42, reg_alpha=0,
                                               reg_lambda=1, scale_pos_weight=1,
                                               subsample=1, tree_method='exact',
                                               validate_parameters=1,
                                               verbosity=0),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False,
                  random_state=42, verbose=0, warm_start=False)
INFO:logs:create_model() successfully completed......................................
INFO:logs:SubProcess create_model() end ==================================
INFO:logs:SubProcess save_model() called ==================================
INFO:logs:Initializing save_model()
INFO:logs:save_model(model=BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                               colsample_bylevel=1,
                                               colsample_bynode=1,
                                               colsample_bytree=1, gamma=0,
                                               gpu_id=-1,
                                               importance_type='gain',
                                               interaction_constraints='',
                                               learning_rate=0.300000012,
                                               max_delta_step=0, max_depth=6,
                                               min_child_weight=1, missing=nan,
                                               monotone_constraints='()',
                                               n_estimators=100, n_jobs=-1,
                                               num_parallel_tree=1,
                                               objective='binary:logistic',
                                               random_state=42, reg_alpha=0,
                                               reg_lambda=1, scale_pos_weight=1,
                                               subsample=1, tree_method='exact',
                                               validate_parameters=1,
                                               verbosity=0),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False,
                  random_state=42, verbose=0, warm_start=False), model_name=Trained Model, verbose=False)
INFO:logs:Appending prep pipeline
INFO:logs:Trained Model.pkl saved in current working directory
INFO:logs:[Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=False, features_todrop=[],
                                      ml_usecase='classification',
                                      numerical_features=[],
                                      target='same_security',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                numeric_strategy='mean',
                                target_variable=None)),
                ('new_levels1',
                 New_Catagor...
                ('group', Empty()), ('nonliner', Empty()), ('scaling', Empty()),
                ('P_transform', Empty()), ('pt_target', Empty()),
                ('binn', Empty()), ('rem_outliers', Empty()),
                ('cluster_all', Empty()),
                ('dummy', Dummify(target='same_security')),
                ('fix_perfect', Empty()), ('clean_names', Clean_Colum_Names()),
                ('feature_select', Empty()), ('fix_multi', Empty()),
                ('dfs', Empty()), ('pca', Empty())],
         verbose=False), BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                               colsample_bylevel=1,
                                               colsample_bynode=1,
                                               colsample_bytree=1, gamma=0,
                                               gpu_id=-1,
                                               importance_type='gain',
                                               interaction_constraints='',
                                               learning_rate=0.300000012,
                                               max_delta_step=0, max_depth=6,
                                               min_child_weight=1, missing=nan,
                                               monotone_constraints='()',
                                               n_estimators=100, n_jobs=-1,
                                               num_parallel_tree=1,
                                               objective='binary:logistic',
                                               random_state=42, reg_alpha=0,
                                               reg_lambda=1, scale_pos_weight=1,
                                               subsample=1, tree_method='exact',
                                               validate_parameters=1,
                                               verbosity=0),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False,
                  random_state=42, verbose=0, warm_start=False)]
INFO:logs:save_model() successfully completed......................................
INFO:logs:SubProcess save_model() end ==================================
INFO:logs:create_model_container: 17
INFO:logs:master_model_container: 17
INFO:logs:display_container: 21
INFO:logs:BaggingClassifier(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                               colsample_bylevel=1,
                                               colsample_bynode=1,
                                               colsample_bytree=1, gamma=0,
                                               gpu_id=-1,
                                               importance_type='gain',
                                               interaction_constraints='',
                                               learning_rate=0.300000012,
                                               max_delta_step=0, max_depth=6,
                                               min_child_weight=1, missing=nan,
                                               monotone_constraints='()',
                                               n_estimators=100, n_jobs=-1,
                                               num_parallel_tree=1,
                                               objective='binary:logistic',
                                               random_state=42, reg_alpha=0,
                                               reg_lambda=1, scale_pos_weight=1,
                                               subsample=1, tree_method='exact',
                                               validate_parameters=1,
                                               verbosity=0),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=-1, oob_score=False,
                  random_state=42, verbose=0, warm_start=False)
INFO:logs:finalize_model() successfully completed......................................

5. Validation on Unseen Data¶

To validadte the model, let's see the prediction with unseen data, which was not included on the final model fit.

In [70]:

predict_unseen = predict_model(final_model, data=unseen);
score_unseen = test_score_report(unseen, predict_unseen)
print(score_unseen.to_string(index=False))
conf_mat(unseen, predict_unseen);

 Accuracy       AUC    Recall  Precision  F1 Score
 0.878505  0.780538  0.952381   0.898876  0.924855

So, the unseen data was well predicted on the model, with a high F1-Score showing a good balance between Recall and precision.

6. Conclusion¶

The following conclusions can be drawed:

The Model just needed 4 Topics Modelled to compare the sentences.
Character length is important to compare the sentences.
Recall was excellent, getting about 95% of the positive labels.
Precision was excellent too, predicting correctly about 90% of the values labeled as positives.
F1-Score was about 92% with a good balance between recall and precision.
The Length on the sentence_y is important to model prediction performance.