Details :¶

Name : Sumit Shamlal Chaure

Batch : 10

Program : Data Science with Python By SkillAcademy

Assignment : Machine Learning Major Assignment

Topics : ML, Model Building, Test & Training

File Downloads :

My Reports & Files :

Note : Certain markdown linkings like page links wont work on google colab/Jupyter but the same on github or vs-code would take you to respective breakpoints as they have advanced markdown support for inline tagging and MD linkings.

Steps Involved in Machine Learning Projects

Understanding the Problem Statement.
Data Collection (From Sources/API/Files).
Data Checking for analysis.
Exploratory Data Analysis (To get insights of dataset & problem)
Data Pre-Processing.
Model Selection & evaluation.
Model Training.
Choosing the Best Model for Best results.
Testing with new data & checking the factors such as recall, accuracy & precision.
Model Deployment
User testing & benchmarking etc.
Reiterating the steps with new data and building more accurate models.

Major Assignment (Machine Learning)¶

1. Understanding the Problem & Dataset¶

Use the Oil Spill Dataset and solve the following question by using the dataset.

About Dataset

The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not.

Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.

The task is, given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean.

There are two classes and the goal is to distinguish between spill and non-spill using the features of a given ocean patch.

● Non-Spill: negative case, or majority class.

● Oil Spill: positive case, or minority class.

There are a total of 50 Columns in the Dataset, the output column is named as a target.

Questions

Download the Oil Spill Dataset and perform Data cleaning and Data Pre-Processing if Necessary.
Use various methods such as Handling null values, One-Hot Encoding, Imputation, and Scaling of Data Pre-Processing where necessary.
Derive some insights from the dataset.
Apply various Machine Learning techniques to predict the output in the target column, make use of Bagging and Ensemble as required, and find the best model by evaluating the model using Model evaluation techniques.
Save the best model and Load the model.
Take the original dataset and make another dataset by randomly picking 20 data points from the oil spill dataset and applying the saved model to the same.

2. Module Imports¶

In [ ]:

import pandas as pd             # for data cleaning and data pre-processing, CSV file I/O,etc
import numpy as np              # linear algebra & for mathematical computation
import matplotlib.pyplot as plt # for visualization
%matplotlib inline
import seaborn as sns           # for visualization
from collections import Counter # to count occurrences
from tabulate import tabulate   # to make tables for results

import warnings                 # for warning removals in code output
warnings.filterwarnings('ignore')

# Scalers & Encoders
from sklearn.preprocessing import StandardScaler, LabelEncoder
#train-test split
from sklearn.model_selection import train_test_split
# Metrics
from sklearn.metrics import (mean_squared_error, r2_score,confusion_matrix, classification_report, accuracy_score,roc_auc_score, roc_curve, auc)
# Model Libraries
from sklearn.linear_model import (LinearRegression, LogisticRegression)
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import (RandomForestClassifier,BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier)
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import pickle           #to save and load model files as pkl file

Q1) Download the Oil Spill Dataset and perform Data cleaning and Data Pre-Processing if Necessary. (Data cleaning & Processing Answer Continued in Q2)¶

2.1) Importing the dataset (With error handling)¶

In [ ]:

# 2.1) Importing the dataset (With error handling)
# If you want to upload the dataset directly (Since on Google Colab it will be lost on re-run) - uncomment the below 2 line code and run
from google.colab import files
uploaded = files.upload()

file_path = "oil_spill.csv"
file_name = file_path.split("/")[-1]

try:
    # Reading the CSV file into a Pandas DataFrame
    df = pd.read_csv(file_path)
    # Store the filename as an attribute in the DataFrame
    df.file_name = file_name
    print(f"\n '{df.file_name}' loaded successfully.")

# Exception to check if the file has some error like no file at the path, etc.
except FileNotFoundError:
    print(f"Error: '{file_name}' not found at the specified location {file_path}.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving oil_spill.csv to oil_spill.csv

 'oil_spill.csv' loaded successfully.

In [ ]:

df

Out[ ]:

	f_1	f_2	f_3	f_4	f_5	f_6	f_7	f_8	f_9	f_10	...	f_41	f_42	f_43	f_44	f_45	f_46	f_47	f_48	f_49	target
0	1	2558	1506.09	456.63	90	6395000	40.88	7.89	29780.0	0.19	...	2850.00	1000.00	763.16	135.46	3.73	0	33243.19	65.74	7.95	1
1	2	22325	79.11	841.03	180	55812500	51.11	1.21	61900.0	0.02	...	5750.00	11500.00	9593.48	1648.80	0.60	0	51572.04	65.73	6.26	0
2	3	115	1449.85	608.43	88	287500	40.42	7.34	3340.0	0.18	...	1400.00	250.00	150.00	45.13	9.33	1	31692.84	65.81	7.84	1
3	4	1201	1562.53	295.65	66	3002500	42.40	7.97	18030.0	0.19	...	6041.52	761.58	453.21	144.97	13.33	1	37696.21	65.67	8.07	1
4	5	312	950.27	440.86	37	780000	41.43	7.03	3350.0	0.17	...	1320.04	710.63	512.54	109.16	2.58	0	29038.17	65.66	7.35	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
932	200	12	92.42	364.42	135	97200	59.42	10.34	884.0	0.17	...	381.84	254.56	84.85	146.97	4.50	0	2593.50	65.85	6.39	0
933	201	11	98.82	248.64	159	89100	59.64	10.18	831.0	0.17	...	284.60	180.00	150.00	51.96	1.90	0	4361.25	65.70	6.53	0
934	202	14	25.14	428.86	24	113400	60.14	17.94	847.0	0.30	...	402.49	180.00	180.00	0.00	2.24	0	2153.05	65.91	6.12	0
935	203	10	96.00	451.30	68	81000	59.90	15.01	831.0	0.25	...	402.49	180.00	90.00	73.48	4.47	0	2421.43	65.97	6.32	0
936	204	11	7.73	235.73	135	89100	61.82	12.24	831.0	0.20	...	254.56	254.56	127.28	180.00	2.00	0	3782.68	65.65	6.26	0

937 rows × 50 columns

Insights: As the assignment is part of major submission i tried to add few exception handling steps to verify things like file not found error. (If we try to import any empty csv in the dataframe it will prompt an exception suggesting no file found at the specified location.)

I have added an extra commented code for google colab imports directly to upload the csv file from local drive and then start the further process.

To use on google colab uncomment the start line of code to import the file on google colab drive instance.

I have not added extraneous code to get the filename from uploaded files and save them directly instead hardcoded/kept it so that even if code is run offline it will work perfectly fine.

Note: As Q.2 needs to do data pre-processing i have added the data cleaning steps in that section directly.

Q2) Use various methods such as Handling null values, One-Hot Encoding,Imputation, and Scaling of Data Pre-Processing where necessary. (Q1 Continued below)¶

2.2) Showing Basic Dataset Information¶

In [ ]:

# Check the shape of the DataFrame
print("\nShape of the DataFrame:")
print(df.shape)
print(df.size)
num_rows, num_columns = df.shape
print(f"Rows: {num_rows}, Columns: {num_columns}")

# Display information about the dataset
print(f"\nDataset information for {df.file_name}:")
df.head(3)
df.tail(3)
print("\nDataset information:")
print(df.info())

Shape of the DataFrame:
(937, 50)
46850
Rows: 937, Columns: 50

Dataset information for oil_spill.csv:

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 937 entries, 0 to 936
Data columns (total 50 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   f_1     937 non-null    int64  
 1   f_2     937 non-null    int64  
 2   f_3     937 non-null    float64
 3   f_4     937 non-null    float64
 4   f_5     937 non-null    int64  
 5   f_6     937 non-null    int64  
 6   f_7     937 non-null    float64
 7   f_8     937 non-null    float64
 8   f_9     937 non-null    float64
 9   f_10    937 non-null    float64
 10  f_11    937 non-null    float64
 11  f_12    937 non-null    float64
 12  f_13    937 non-null    float64
 13  f_14    937 non-null    float64
 14  f_15    937 non-null    float64
 15  f_16    937 non-null    float64
 16  f_17    937 non-null    float64
 17  f_18    937 non-null    float64
 18  f_19    937 non-null    float64
 19  f_20    937 non-null    float64
 20  f_21    937 non-null    float64
 21  f_22    937 non-null    float64
 22  f_23    937 non-null    int64  
 23  f_24    937 non-null    float64
 24  f_25    937 non-null    float64
 25  f_26    937 non-null    float64
 26  f_27    937 non-null    float64
 27  f_28    937 non-null    float64
 28  f_29    937 non-null    float64
 29  f_30    937 non-null    float64
 30  f_31    937 non-null    float64
 31  f_32    937 non-null    float64
 32  f_33    937 non-null    float64
 33  f_34    937 non-null    float64
 34  f_35    937 non-null    int64  
 35  f_36    937 non-null    int64  
 36  f_37    937 non-null    float64
 37  f_38    937 non-null    float64
 38  f_39    937 non-null    int64  
 39  f_40    937 non-null    int64  
 40  f_41    937 non-null    float64
 41  f_42    937 non-null    float64
 42  f_43    937 non-null    float64
 43  f_44    937 non-null    float64
 44  f_45    937 non-null    float64
 45  f_46    937 non-null    int64  
 46  f_47    937 non-null    float64
 47  f_48    937 non-null    float64
 48  f_49    937 non-null    float64
 49  target  937 non-null    int64  
dtypes: float64(39), int64(11)
memory usage: 366.1 KB
None

Columns of the dataset¶

In [ ]:

# Display the columns & rows of dataset
print(f"The columns of our {file_name} dataframe\n")
print(df.columns)

The columns of our oil_spill.csv dataframe

Index(['f_1', 'f_2', 'f_3', 'f_4', 'f_5', 'f_6', 'f_7', 'f_8', 'f_9', 'f_10',
       'f_11', 'f_12', 'f_13', 'f_14', 'f_15', 'f_16', 'f_17', 'f_18', 'f_19',
       'f_20', 'f_21', 'f_22', 'f_23', 'f_24', 'f_25', 'f_26', 'f_27', 'f_28',
       'f_29', 'f_30', 'f_31', 'f_32', 'f_33', 'f_34', 'f_35', 'f_36', 'f_37',
       'f_38', 'f_39', 'f_40', 'f_41', 'f_42', 'f_43', 'f_44', 'f_45', 'f_46',
       'f_47', 'f_48', 'f_49', 'target'],
      dtype='object')

3) Data Checks to Perform¶

Check Missing Values.
Check Duplicates
Check Data Type
Check Unique Values
Check Data Statistics
Check Categorical Columns & Data

Missing Values/Null Check¶

In [ ]:

print("Missing values in the dataset:\n")
print(df.isnull().sum())
# NA value calculation
nullval = df.isna().sum()
nullval = nullval[nullval > 0]
print("\nSum of Missing values:\n", nullval)

Missing values in the dataset:

f_1       0
f_2       0
f_3       0
f_4       0
f_5       0
f_6       0
f_7       0
f_8       0
f_9       0
f_10      0
f_11      0
f_12      0
f_13      0
f_14      0
f_15      0
f_16      0
f_17      0
f_18      0
f_19      0
f_20      0
f_21      0
f_22      0
f_23      0
f_24      0
f_25      0
f_26      0
f_27      0
f_28      0
f_29      0
f_30      0
f_31      0
f_32      0
f_33      0
f_34      0
f_35      0
f_36      0
f_37      0
f_38      0
f_39      0
f_40      0
f_41      0
f_42      0
f_43      0
f_44      0
f_45      0
f_46      0
f_47      0
f_48      0
f_49      0
target    0
dtype: int64

Sum of Missing values:
 Series([], dtype: int64)

No Null values or NA are present in our dataset.

Duplicate Values Check¶

In [ ]:

print("\nChecking for duplicated values:\n")
print(df.duplicated())
print("\nSum of Duplicated Values in Dataframe :", df.duplicated().sum())

Checking for duplicated values:

0      False
1      False
2      False
3      False
4      False
       ...  
932    False
933    False
934    False
935    False
936    False
Length: 937, dtype: bool

Sum of Duplicated Values in Dataframe : 0

The sum of duplicated value comes to be zero indicating that no duplicates are present in the dataset.

In [ ]:

# Calculating the number of missing values or null values in df
total_missing_values = df.isnull().sum().sum()
print("The number of missing values/NA in dataframe :", total_missing_values)

# Calculate the total number of values in df (excluding the missing values)
total_values = df.size
print("Total number of values in dataframe :", total_values)

# percentage of missing values or null values in df
percentage_missing_values = (total_missing_values / total_values) * 100
print("Percentage of missing values or null values in df :",
      percentage_missing_values)

The number of missing values/NA in dataframe : 0
Total number of values in dataframe : 46850
Percentage of missing values or null values in df : 0.0

No Missing values are present in the dataset.

Check Value Counts¶

In [ ]:

print("Value counts of the dataset by datatypes")
df.dtypes.value_counts()

Value counts of the dataset by datatypes

Out[ ]:

float64    39
int64      11
dtype: int64

Unique Values Check¶

In [ ]:

print("Unique Value counts inside each columns")
df.nunique()

Unique Value counts inside each columns

Out[ ]:

f_1       238
f_2       297
f_3       927
f_4       933
f_5       179
f_6       375
f_7       820
f_8       618
f_9       561
f_10       57
f_11      577
f_12       59
f_13       73
f_14      107
f_15       53
f_16       91
f_17      893
f_18      810
f_19      170
f_20       53
f_21       68
f_22        9
f_23        1
f_24       92
f_25        9
f_26        8
f_27        9
f_28      308
f_29      447
f_30      392
f_31      107
f_32       42
f_33        4
f_34       45
f_35      141
f_36      110
f_37        3
f_38      758
f_39        9
f_40        9
f_41      388
f_42      220
f_43      644
f_44      649
f_45      499
f_46        2
f_47      937
f_48      169
f_49      286
target      2
dtype: int64

From the above unique value check many things can be observed which we will need to dive deep in further analysis.

The target column is the one which we need to operate on for our operations.

Datatype Check¶

In [ ]:

# Check data types of each column
print("Data types of each column:")
df.dtypes

Data types of each column:

Out[ ]:

f_1         int64
f_2         int64
f_3       float64
f_4       float64
f_5         int64
f_6         int64
f_7       float64
f_8       float64
f_9       float64
f_10      float64
f_11      float64
f_12      float64
f_13      float64
f_14      float64
f_15      float64
f_16      float64
f_17      float64
f_18      float64
f_19      float64
f_20      float64
f_21      float64
f_22      float64
f_23        int64
f_24      float64
f_25      float64
f_26      float64
f_27      float64
f_28      float64
f_29      float64
f_30      float64
f_31      float64
f_32      float64
f_33      float64
f_34      float64
f_35        int64
f_36        int64
f_37      float64
f_38      float64
f_39        int64
f_40        int64
f_41      float64
f_42      float64
f_43      float64
f_44      float64
f_45      float64
f_46        int64
f_47      float64
f_48      float64
f_49      float64
target      int64
dtype: object

Statistics Summary¶

In [ ]:

print("\nSummary statistics:\n")
print(df.describe())

Summary statistics:

              f_1           f_2          f_3          f_4         f_5  \
count  937.000000    937.000000   937.000000   937.000000  937.000000   
mean    81.588047    332.842049   698.707086   870.992209   84.121665   
std     64.976730   1931.938570   599.965577   522.799325   45.361771   
min      1.000000     10.000000     1.920000     1.000000    0.000000   
25%     31.000000     20.000000    85.270000   444.200000   54.000000   
50%     64.000000     65.000000   704.370000   761.280000   73.000000   
75%    124.000000    132.000000  1223.480000  1260.370000  117.000000   
max    352.000000  32389.000000  1893.080000  2724.570000  180.000000   

                f_6         f_7         f_8            f_9        f_10  ...  \
count  9.370000e+02  937.000000  937.000000     937.000000  937.000000  ...   
mean   7.696964e+05   43.242721    9.127887    3940.712914    0.221003  ...   
std    3.831151e+06   12.718404    3.588878    8167.427625    0.090316  ...   
min    7.031200e+04   21.240000    0.830000     667.000000    0.020000  ...   
25%    1.250000e+05   33.650000    6.750000    1371.000000    0.160000  ...   
50%    1.863000e+05   39.970000    8.200000    2090.000000    0.200000  ...   
75%    3.304680e+05   52.420000   10.760000    3435.000000    0.260000  ...   
max    7.131500e+07   82.640000   24.690000  160740.000000    0.740000  ...   

               f_41          f_42         f_43         f_44        f_45  \
count    937.000000    937.000000   937.000000   937.000000  937.000000   
mean     933.928677    427.565582   255.435902   106.112519    5.014002   
std     1001.681331    715.391648   534.306194   135.617708    5.029151   
min        0.000000      0.000000     0.000000     0.000000    0.000000   
25%      450.000000    180.000000    90.800000    50.120000    2.370000   
50%      685.420000    270.000000   161.650000    73.850000    3.850000   
75%     1053.420000    460.980000   265.510000   125.810000    6.320000   
max    11949.330000  11500.000000  9593.480000  1748.130000   76.630000   

             f_46          f_47        f_48        f_49      target  
count  937.000000    937.000000  937.000000  937.000000  937.000000  
mean     0.128068   7985.718004   61.694386    8.119723    0.043757  
std      0.334344   6854.504915   10.412807    2.908895    0.204662  
min      0.000000   2051.500000   35.950000    5.810000    0.000000  
25%      0.000000   3760.570000   65.720000    6.340000    0.000000  
50%      0.000000   5509.430000   65.930000    7.220000    0.000000  
75%      0.000000   9521.930000   66.130000    7.840000    0.000000  
max      1.000000  55128.460000   66.450000   15.440000    1.000000  

[8 rows x 50 columns]

Categorical Column Check¶

In [ ]:

df.select_dtypes(include="category")
categorical_columns = df.select_dtypes(include=["object"]).columns
print("\nCategorical columns:")
print(categorical_columns)

Categorical columns:
Index([], dtype='object')

Checking Descriptive Statistics¶

In [ ]:

print(f"The descriptive Stats for the {file_name} dataset:")
df.describe()

The descriptive Stats for the oil_spill.csv dataset:

Out[ ]:

	f_1	f_2	f_3	f_4	f_5	f_6	f_7	f_8	f_9	f_10	...	f_41	f_42	f_43	f_44	f_45	f_46	f_47	f_48	f_49	target
count	937.000000	937.000000	937.000000	937.000000	937.000000	9.370000e+02	937.000000	937.000000	937.000000	937.000000	...	937.000000	937.000000	937.000000	937.000000	937.000000	937.000000	937.000000	937.000000	937.000000	937.000000
mean	81.588047	332.842049	698.707086	870.992209	84.121665	7.696964e+05	43.242721	9.127887	3940.712914	0.221003	...	933.928677	427.565582	255.435902	106.112519	5.014002	0.128068	7985.718004	61.694386	8.119723	0.043757
std	64.976730	1931.938570	599.965577	522.799325	45.361771	3.831151e+06	12.718404	3.588878	8167.427625	0.090316	...	1001.681331	715.391648	534.306194	135.617708	5.029151	0.334344	6854.504915	10.412807	2.908895	0.204662
min	1.000000	10.000000	1.920000	1.000000	0.000000	7.031200e+04	21.240000	0.830000	667.000000	0.020000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2051.500000	35.950000	5.810000	0.000000
25%	31.000000	20.000000	85.270000	444.200000	54.000000	1.250000e+05	33.650000	6.750000	1371.000000	0.160000	...	450.000000	180.000000	90.800000	50.120000	2.370000	0.000000	3760.570000	65.720000	6.340000	0.000000
50%	64.000000	65.000000	704.370000	761.280000	73.000000	1.863000e+05	39.970000	8.200000	2090.000000	0.200000	...	685.420000	270.000000	161.650000	73.850000	3.850000	0.000000	5509.430000	65.930000	7.220000	0.000000
75%	124.000000	132.000000	1223.480000	1260.370000	117.000000	3.304680e+05	52.420000	10.760000	3435.000000	0.260000	...	1053.420000	460.980000	265.510000	125.810000	6.320000	0.000000	9521.930000	66.130000	7.840000	0.000000
max	352.000000	32389.000000	1893.080000	2724.570000	180.000000	7.131500e+07	82.640000	24.690000	160740.000000	0.740000	...	11949.330000	11500.000000	9593.480000	1748.130000	76.630000	1.000000	55128.460000	66.450000	15.440000	1.000000

8 rows × 50 columns

In [ ]:

print("Complete Stats of every column")
print(df.describe())

Complete Stats of every column
              f_1           f_2          f_3          f_4         f_5  \
count  937.000000    937.000000   937.000000   937.000000  937.000000   
mean    81.588047    332.842049   698.707086   870.992209   84.121665   
std     64.976730   1931.938570   599.965577   522.799325   45.361771   
min      1.000000     10.000000     1.920000     1.000000    0.000000   
25%     31.000000     20.000000    85.270000   444.200000   54.000000   
50%     64.000000     65.000000   704.370000   761.280000   73.000000   
75%    124.000000    132.000000  1223.480000  1260.370000  117.000000   
max    352.000000  32389.000000  1893.080000  2724.570000  180.000000   

                f_6         f_7         f_8            f_9        f_10  ...  \
count  9.370000e+02  937.000000  937.000000     937.000000  937.000000  ...   
mean   7.696964e+05   43.242721    9.127887    3940.712914    0.221003  ...   
std    3.831151e+06   12.718404    3.588878    8167.427625    0.090316  ...   
min    7.031200e+04   21.240000    0.830000     667.000000    0.020000  ...   
25%    1.250000e+05   33.650000    6.750000    1371.000000    0.160000  ...   
50%    1.863000e+05   39.970000    8.200000    2090.000000    0.200000  ...   
75%    3.304680e+05   52.420000   10.760000    3435.000000    0.260000  ...   
max    7.131500e+07   82.640000   24.690000  160740.000000    0.740000  ...   

               f_41          f_42         f_43         f_44        f_45  \
count    937.000000    937.000000   937.000000   937.000000  937.000000   
mean     933.928677    427.565582   255.435902   106.112519    5.014002   
std     1001.681331    715.391648   534.306194   135.617708    5.029151   
min        0.000000      0.000000     0.000000     0.000000    0.000000   
25%      450.000000    180.000000    90.800000    50.120000    2.370000   
50%      685.420000    270.000000   161.650000    73.850000    3.850000   
75%     1053.420000    460.980000   265.510000   125.810000    6.320000   
max    11949.330000  11500.000000  9593.480000  1748.130000   76.630000   

             f_46          f_47        f_48        f_49      target  
count  937.000000    937.000000  937.000000  937.000000  937.000000  
mean     0.128068   7985.718004   61.694386    8.119723    0.043757  
std      0.334344   6854.504915   10.412807    2.908895    0.204662  
min      0.000000   2051.500000   35.950000    5.810000    0.000000  
25%      0.000000   3760.570000   65.720000    6.340000    0.000000  
50%      0.000000   5509.430000   65.930000    7.220000    0.000000  
75%      0.000000   9521.930000   66.130000    7.840000    0.000000  
max      1.000000  55128.460000   66.450000   15.440000    1.000000  

[8 rows x 50 columns]

Q3)Derive some insights from the dataset.¶

Class Distribution (Target column)¶

In [ ]:

# Basic Class summary
print("\nClass distribution:\n")
print(df["target"].value_counts())

# summarize the class distribution
target = df.values[:, -1]
counter = Counter(target)
print("\nClass Distribution Summary:\n")
for k, v in counter.items():
    per = v / len(target) * 100
    print("Class=%d, Count=%d, Percentage=%.3f%%" % (k, v, per))

Class distribution:

0    896
1     41
Name: target, dtype: int64

Class Distribution Summary:

Class=1, Count=41, Percentage=4.376%
Class=0, Count=896, Percentage=95.624%

In [ ]:

# Countplot for Target Variable
ax = sns.countplot(x=df["target"], palette="husl", alpha=0.7)
plt.title("Countplot for Target Column")
plt.xlabel("Target Variable")
plt.ylabel("Count")

# Loop for annotation
for p in ax.patches:
    ax.text(
        p.get_x() + p.get_width() / 2.0,
        p.get_height(),
        f"{p.get_height()}",
        ha="center",
        va="bottom",
        fontsize=8,
        color="black",
    )

plt.show()

The target column on which we need to work for our classification shows that the dataset indicates:

896 Non oil-spill data/regions or roughly 95.6% of dataset
41 Oil-spill data/regions i.e. around 4.4% of dataset.
The above information will help us to design and test our model to check the predictions and its accuracy.

More info on color pallete

Histogram to see the data analysis¶

In [ ]:

fig = plt.figure(figsize=(25, 35))
ax = fig.gca()

_ = df.hist(ax=ax, color="green", edgecolor="black")
# Add a title at the top of the subplots
plt.suptitle("Histograms of DataFrame Columns", y=0.90, fontsize=24)

plt.show()

Simple Pie-chart to see target column distribution¶

In [ ]:

piechart = df["target"].value_counts()
# Create a pie chart with labels, numbers, and percentages
plt.pie(piechart, labels=["No-Spill", "Oil-Spill"],
        autopct="%0.1f%%", radius=1)
plt.title("Distribution of Spill Classes")
plt.show()

Pie-chart to show the single value in column F_23¶

In [ ]:

f_23_distribution = df["f_23"].value_counts()
labels = f_23_distribution.index.map(str)
values = f_23_distribution.values
plt.pie(values, labels=labels, autopct="%0.1f%%", radius=1)
plt.title(f'Distribution of column "F_23"')
plt.show()

Boxplots for the outlier checking of dataset¶

In [ ]:

# Set the number of subplots per row
subplots_per_row = 5

# Calculate the number of rows needed based on the number of columns and subplots per row
num_rows = (len(df.columns) - 1) // subplots_per_row + 1

# Set up the subplots
fig, axes = plt.subplots(
    nrows=num_rows, ncols=subplots_per_row, figsize=(25, 25))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Iterate over each column and create boxplots
for ax, column in zip(axes, df.columns):
    sns.boxplot(x=df[column], ax=ax, color="orange", width=0.5)
    ax.set_title(column, fontsize=14)
    ax.set_xlabel("Count")
    ax.set_ylabel("Values")

# Adjust layout for better spacing between subplots
plt.tight_layout()

# Add a common title at the top of the subplots
fig.suptitle(
    "Boxplots of DataFrame Columns (To Display Outliers in Dataset)",
    y=1.02,
    fontsize=24,
)

# Show the plots
plt.show()

Correlation matrix & Heatmap to analyze the correlation factor among various factors(columns)¶

In [ ]:

corr_matrix = df.corr()
df.corr()

Out[ ]:

	f_1	f_2	f_3	f_4	f_5	f_6	f_7	f_8	f_9	f_10	...	f_41	f_42	f_43	f_44	f_45	f_46	f_47	f_48	f_49	target
f_1	1.000000	-0.155581	0.172017	-0.104116	-0.017025	-0.169533	-0.037412	-0.204983	-0.244551	-0.214447	...	-0.286190	-0.167466	-0.156916	-0.141792	-0.139478	-0.163693	-0.202983	0.294422	-0.253698	-0.180531
f_2	-0.155581	1.000000	0.058390	0.052638	-0.036870	0.953947	-0.136761	-0.016822	0.829978	0.128465	...	0.555154	0.777807	0.800939	0.716496	-0.080879	-0.048315	0.118792	-0.128222	0.139417	0.034128
f_3	0.172017	0.058390	1.000000	0.549510	-0.082764	0.050795	-0.627934	-0.349541	0.158686	0.073794	...	0.186920	0.178287	0.129653	0.176883	-0.088310	-0.182458	-0.022098	0.048291	0.162600	-0.035221
f_4	-0.104116	0.052638	0.549510	1.000000	0.048847	0.024693	-0.546205	-0.222063	0.097683	0.202167	...	-0.046934	0.032402	0.022234	0.000664	-0.220461	-0.204776	0.106758	-0.394081	0.476127	-0.050489
f_5	-0.017025	-0.036870	-0.082764	0.048847	1.000000	-0.028431	0.059128	0.123814	-0.047879	0.098573	...	-0.066930	-0.014877	-0.013742	-0.012346	-0.076695	-0.080136	0.070070	-0.135294	0.116896	-0.078598
f_6	-0.169533	0.953947	0.050795	0.024693	-0.028431	1.000000	-0.093589	-0.001395	0.894150	0.097449	...	0.594273	0.844597	0.868353	0.770044	-0.077783	-0.046834	0.126850	-0.058752	0.069731	0.049318
f_7	-0.037412	-0.136761	-0.627934	-0.546205	0.059128	-0.093589	1.000000	0.381206	-0.188076	-0.380340	...	-0.115014	-0.100003	-0.074308	-0.073751	0.077207	0.088633	-0.157243	0.483034	-0.612819	-0.026183
f_8	-0.204983	-0.016822	-0.349541	-0.222063	0.123814	-0.001395	0.381206	1.000000	0.001073	0.670628	...	0.013476	-0.015712	-0.013193	0.002439	-0.061639	-0.051879	-0.028117	-0.101155	0.033731	-0.014434
f_9	-0.244551	0.829978	0.158686	0.097683	-0.047879	0.894150	-0.188076	0.001073	1.000000	0.164098	...	0.675610	0.784833	0.770129	0.736075	-0.073312	-0.048994	0.102540	-0.080203	0.113389	0.076679
f_10	-0.214447	0.128465	0.073794	0.202167	0.098573	0.097449	-0.380340	0.670628	0.164098	1.000000	...	0.082449	0.052518	0.043116	0.042269	-0.113481	-0.095896	0.112275	-0.587156	0.603358	-0.013359
f_11	-0.261624	0.745590	-0.064076	-0.082742	-0.075843	0.765628	0.093376	0.167904	0.671358	0.102331	...	0.630674	0.782581	0.790649	0.710990	-0.160260	-0.114133	0.127889	0.056237	-0.067659	0.157588
f_12	-0.209190	0.004035	-0.081738	0.106767	0.009470	-0.029363	-0.363593	0.406409	-0.008391	0.747509	...	-0.088211	-0.135129	-0.121701	-0.147694	-0.018188	0.045217	0.073414	-0.610604	0.594751	0.018417
f_13	-0.222342	0.020195	0.042723	0.224342	0.013574	-0.017706	-0.481003	0.289904	0.018342	0.730810	...	-0.084692	-0.120182	-0.109534	-0.140570	-0.067821	0.008266	0.128967	-0.665751	0.674792	0.036129
f_14	-0.220721	0.176080	0.299324	0.335270	-0.016254	0.155767	-0.574566	0.178362	0.261617	0.652360	...	0.177034	0.141294	0.117372	0.130096	-0.145173	-0.104025	0.104333	-0.539941	0.600364	0.044022
f_15	-0.137901	-0.118317	-0.301641	-0.039329	0.028305	-0.147712	-0.115334	0.335692	-0.215468	0.502049	...	-0.292963	-0.293204	-0.250771	-0.308273	0.035983	0.128789	0.071589	-0.501766	0.443858	-0.008092
f_16	-0.178220	0.235500	0.439603	0.372116	-0.029425	0.226015	-0.563544	0.051995	0.365164	0.487945	...	0.305778	0.269345	0.227190	0.262997	-0.169739	-0.162068	0.082550	-0.369724	0.457185	0.050515
f_17	0.056430	0.237388	-0.003753	-0.000815	0.045836	0.302462	-0.008360	-0.245330	0.160027	-0.231361	...	0.119187	0.361130	0.392898	0.287938	-0.055731	-0.054833	0.368569	0.078798	-0.081241	0.014977
f_18	0.027526	0.321276	-0.046857	-0.020119	0.065762	0.406917	0.027642	-0.188000	0.207135	-0.196430	...	0.144958	0.459463	0.509775	0.353361	-0.048525	-0.040779	0.329074	0.058242	-0.070072	-0.006263
f_19	0.038746	0.022253	0.599107	0.494286	-0.065304	0.046484	-0.134812	-0.254853	0.087192	-0.190016	...	0.181168	0.197681	0.167883	0.178385	-0.073405	-0.156083	0.074076	0.263886	-0.134275	0.022329
f_20	-0.159138	-0.053111	-0.193047	-0.011078	0.048283	-0.075633	-0.222835	0.505829	-0.080864	0.725196	...	-0.155566	-0.193686	-0.172891	-0.201335	0.050078	0.103484	0.011602	-0.519132	0.481139	-0.049940
f_21	-0.170053	-0.057095	-0.033185	0.132370	0.032183	-0.080839	-0.386666	0.361435	-0.069741	0.722926	...	-0.147508	-0.181691	-0.165905	-0.196407	0.002484	0.059500	0.086494	-0.592765	0.587612	-0.017439
f_22	-0.240241	0.140805	0.299426	0.615556	0.122604	0.076952	-0.606984	-0.086842	0.120763	0.474004	...	-0.050153	0.002071	0.010254	-0.045236	-0.188274	-0.148988	0.317913	-0.870454	0.916544	0.035323
f_23	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
f_24	0.026409	-0.142899	-0.669548	-0.667478	0.001713	-0.097027	0.906524	0.360343	-0.186528	-0.362216	...	-0.096460	-0.111014	-0.088658	-0.074335	0.133210	0.150626	-0.229779	0.568317	-0.701463	-0.040364
f_25	-0.260500	0.131439	0.059858	0.466752	0.130162	0.060381	-0.565818	0.041689	0.095497	0.584004	...	-0.101886	-0.070467	-0.052167	-0.111762	-0.147284	-0.076648	0.279861	-0.990288	0.986720	-0.013202
f_26	0.397330	-0.088328	-0.013191	-0.503382	-0.109169	-0.042262	0.138941	-0.167225	-0.039744	-0.377968	...	0.083950	0.044717	0.018422	0.096065	0.126362	0.064918	-0.260992	0.700420	-0.690917	-0.054643
f_27	0.404138	-0.013225	0.627137	0.115749	-0.110526	0.010893	-0.393415	-0.455887	0.084778	-0.252999	...	0.186856	0.171926	0.118393	0.200911	-0.012176	-0.122450	-0.147258	0.467314	-0.328293	-0.068181
f_28	-0.173291	0.091962	-0.119572	0.113152	0.138419	0.047533	-0.145690	0.349434	0.059639	0.548929	...	-0.009764	-0.027413	-0.019933	-0.042534	-0.094926	-0.043885	0.259758	-0.515659	0.489585	0.061178
f_29	-0.158883	0.170798	0.012991	0.167272	0.045215	0.118327	-0.283833	0.175192	0.145497	0.507455	...	0.089752	0.039657	0.037170	0.017672	-0.036557	-0.011833	0.243373	-0.481778	0.484850	0.021424
f_30	0.237770	-0.163065	-0.368946	-0.551173	-0.094680	-0.100486	0.725108	0.082133	-0.167507	-0.536527	...	-0.018595	-0.031007	-0.027820	0.007076	0.158981	0.111419	-0.299345	0.831027	-0.907202	-0.050517
f_31	0.035824	0.005472	-0.097925	-0.358789	-0.175452	0.058062	0.228706	-0.013229	0.070491	-0.233577	...	0.266927	0.149927	0.132506	0.153446	0.268631	0.208135	-0.115635	0.487006	-0.499680	0.041730
f_32	-0.094846	0.118776	0.585351	0.686419	0.062919	0.060508	-0.818055	-0.235055	0.172914	0.432378	...	0.068535	0.063352	0.036559	0.041763	-0.184435	-0.199718	0.231370	-0.673397	0.785353	0.013173
f_33	-0.036654	-0.009433	-0.061054	-0.064612	0.044074	-0.009910	0.101277	0.082125	-0.020690	0.003666	...	-0.017361	-0.026800	-0.020555	-0.025864	0.063069	0.091043	0.051232	0.022010	-0.034338	-0.012170
f_34	-0.091356	0.118634	0.585760	0.686369	0.059118	0.060826	-0.819826	-0.239614	0.173239	0.428950	...	0.069370	0.064939	0.037864	0.043431	-0.187893	-0.205192	0.225773	-0.670186	0.782269	0.014008
f_35	-0.225343	0.869227	0.178255	0.151832	-0.044723	0.884713	-0.246512	-0.021463	0.979517	0.201878	...	0.624000	0.749043	0.739734	0.698957	-0.093407	-0.065521	0.118231	-0.169488	0.203509	0.046540
f_36	-0.216387	0.873996	0.177423	0.147977	-0.039225	0.892963	-0.239744	-0.019355	0.980876	0.198350	...	0.625060	0.758187	0.748835	0.708918	-0.095036	-0.068246	0.116058	-0.163067	0.196819	0.040756
f_37	0.281274	-0.148739	0.246582	0.284814	0.100720	-0.179259	-0.386837	-0.250518	-0.209671	0.085596	...	-0.307490	-0.242312	-0.232705	-0.244889	-0.007764	-0.028332	0.080429	-0.354068	0.398088	-0.100417
f_38	-0.260929	0.443913	0.332342	0.262117	-0.006470	0.480804	-0.383085	-0.086931	0.778100	0.221490	...	0.579851	0.496987	0.437209	0.485860	0.008159	-0.006834	0.108795	-0.180859	0.247958	0.041885
f_39	-0.452966	0.080779	-0.279094	0.282325	0.204627	0.021929	-0.130835	0.297139	0.003008	0.517838	...	-0.180546	-0.134773	-0.097368	-0.173564	-0.154701	-0.052587	0.305194	-0.884484	0.811961	0.033768
f_40	-0.499695	0.071089	-0.165125	0.344152	0.232303	0.021595	-0.051147	0.281511	0.001339	0.424904	...	-0.152639	-0.087815	-0.057056	-0.125037	-0.194900	-0.117922	0.333216	-0.740305	0.691901	0.066220
f_41	-0.286190	0.555154	0.186920	-0.046934	-0.066930	0.594273	-0.115014	0.013476	0.675610	0.082449	...	1.000000	0.703587	0.632130	0.714021	0.179308	0.106656	0.083573	0.120024	-0.068486	0.148987
f_42	-0.167466	0.777807	0.178287	0.032402	-0.014877	0.844597	-0.100003	-0.015712	0.784833	0.052518	...	0.703587	1.000000	0.979836	0.932383	-0.120047	-0.137111	0.119719	0.090195	-0.047456	0.050657
f_43	-0.156916	0.800939	0.129653	0.022234	-0.013742	0.868353	-0.074308	-0.013193	0.770129	0.043116	...	0.632130	0.979836	1.000000	0.860925	-0.131742	-0.114537	0.134836	0.064703	-0.034036	0.046533
f_44	-0.141792	0.716496	0.176883	0.000664	-0.012346	0.770044	-0.073751	0.002439	0.736075	0.042269	...	0.714021	0.932383	0.860925	1.000000	-0.098196	-0.156458	0.072550	0.133416	-0.089327	0.031244
f_45	-0.139478	-0.080879	-0.088310	-0.220461	-0.076695	-0.077783	0.077207	-0.061639	-0.073312	-0.113481	...	0.179308	-0.120047	-0.131742	-0.098196	1.000000	0.545285	-0.061429	0.130842	-0.141206	0.016261
f_46	-0.163693	-0.048315	-0.182458	-0.204776	-0.080136	-0.046834	0.088633	-0.051879	-0.048994	-0.095896	...	0.106656	-0.137111	-0.114537	-0.156458	0.545285	1.000000	-0.011024	0.047073	-0.079484	0.058537
f_47	-0.202983	0.118792	-0.022098	0.106758	0.070070	0.126850	-0.157243	-0.028117	0.102540	0.112275	...	0.083573	0.119719	0.134836	0.072550	-0.061429	-0.011024	1.000000	-0.292330	0.299541	0.436890
f_48	0.294422	-0.128222	0.048291	-0.394081	-0.135294	-0.058752	0.483034	-0.101155	-0.080203	-0.587156	...	0.120024	0.090195	0.064703	0.133416	0.130842	0.047073	-0.292330	1.000000	-0.974548	-0.003163
f_49	-0.253698	0.139417	0.162600	0.476127	0.116896	0.069731	-0.612819	0.033731	0.113389	0.603358	...	-0.068486	-0.047456	-0.034036	-0.089327	-0.141206	-0.079484	0.299541	-0.974548	1.000000	0.008365
target	-0.180531	0.034128	-0.035221	-0.050489	-0.078598	0.049318	-0.026183	-0.014434	0.076679	-0.013359	...	0.148987	0.050657	0.046533	0.031244	0.016261	0.058537	0.436890	-0.003163	0.008365	1.000000

50 rows × 50 columns

In [ ]:

plt.figure(figsize=(50, 50))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm",
            fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix - Oil Spill Dataset", fontsize=16)
plt.show()

Insights:

From above Correlation and graphs it is clear that F_23 contains a single value and it will not have major impact on model so we will consider dropping it.
From the correlation & heatmap we can see few highly correlated values/columns in our dataset, we need to treat them to overcome below problems:

Multicollinearity: Highly correlated features can lead to multicollinearity, which can make it difficult for models to accurately estimate the individual effects of each feature.This can result in:

Increased variance in model coefficients, making them less reliable.

Reduced model interpretability, as it's unclear which features are truly driving predictions.

Overfitting: Highly correlated variables can lead to overfitting in some models, especially if the dataset is not large enough.

Model Stability: Unnecessary redundancy in features may lead to less stable model performance.

Removing Highly Correlated Columns

In [ ]:

# Correlation matrix
corr_matrix = df.corr()

# Selecting upper triangle of correlation matrix
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features/columns with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]
print(f"Columns with high correlation values : \n{to_drop} \n\nTotal Correlated columns : {len(to_drop)}")

Columns with high correlation values : 
['f_6', 'f_13', 'f_16', 'f_18', 'f_20', 'f_21', 'f_24', 'f_25', 'f_34', 'f_35', 'f_36', 'f_40', 'f_43', 'f_44', 'f_49'] 

Total Correlated columns : 15

In [ ]:

# Drop features/columns
df1 = df.copy()
df1.drop(to_drop, axis=1, inplace=True)
# dropping F_23 since it only has single value
f23 = "f_23"
df1.drop(f23, axis=1, inplace=True)
cleaned_df = df1.copy()
print("Original Dataframe:", df.shape)
print("Cleaned Dataframe (Highly correlated columns removed):", cleaned_df.shape)
print("\nRemoved Columns from dataset:\n", to_drop + [f23])

Original Dataframe: (937, 50)
Cleaned Dataframe (Highly correlated columns removed): (937, 34)

Removed Columns from dataset:
 ['f_6', 'f_13', 'f_16', 'f_18', 'f_20', 'f_21', 'f_24', 'f_25', 'f_34', 'f_35', 'f_36', 'f_40', 'f_43', 'f_44', 'f_49', 'f_23']

Q4) Apply various Machine Learning techniques to predict the output in the target column, make use of Bagging and Ensemble as required, and find the best model by evaluating the model using Model evaluation techniques.¶

Dependant (y) & Independent (x) Features¶

Dropping dependant feature from dataset

In [ ]:

x = cleaned_df.drop("target", axis=1)
y = cleaned_df["target"]

print(type(x))
print(type(y))
print(x.shape)
print(y.shape)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
(937, 33)
(937,)

Splitting The dataset

In [ ]:

# splitting the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=42
)
print(f"Split Check Test values : {937 * 0.3} & Train values : {937 * 0.7}")
# rows , columns
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Split Check Test values : 281.09999999999997 & Train values : 655.9
(655, 33)
(282, 33)
(655,)
(282,)

3. Display Splitted data¶

In [ ]:

X_train, X_test

Out[ ]:

(     f_1    f_2      f_3      f_4  f_5    f_7    f_8      f_9  f_10   f_11  \
 757   25     18    78.11   456.00   70  48.94   7.79   1588.0  0.16   91.8   
 693   46     18   153.39   464.39   13  70.33  15.87   1228.0  0.23  118.7   
 854  122     14   141.86   446.50   66  51.71   5.44   1064.0  0.10  106.6   
 501    7    603   299.61  1472.03  114  25.01   8.98   7654.0  0.36  110.8   
 664   17    162     7.70   546.00   64  70.65  15.70   5362.0  0.22  244.7   
 ..   ...    ...      ...      ...  ...    ...    ...      ...   ...    ...   
 106   96     73  1391.75   934.48   54  41.23   6.75   2570.0  0.16   71.0   
 270  227     63  1139.70   934.33  138  31.78   7.68   2730.0  0.24   57.7   
 860  128     15    80.87   264.07   61  54.40   7.73   1302.0  0.14   93.3   
 435    3  32389   874.99  1210.98   35  24.62   9.75  62250.0  0.40  731.7   
 102   92    121  1171.12  1388.43   66  40.15   9.26   3440.0  0.23   87.9   
 
      ...  f_33  f_37   f_38  f_39     f_41     f_42   f_45  f_46     f_47  \
 757  ...   0.0  0.00  17.30    82   853.81   180.00  12.20     0  2674.72   
 693  ...   0.0  0.00  10.34   102   742.16   270.00   6.00     0  8227.75   
 854  ...   0.0  0.00   9.98    82   484.66   180.00   3.85     0  3830.45   
 501  ...   0.0  0.01  69.09   143     0.00     0.00   0.00     0  7780.79   
 664  ...   0.0  0.00  21.91   102  1288.60  1170.00   1.47     0  3777.50   
 ..   ...   ...   ...    ...   ...      ...      ...    ...   ...      ...   
 106  ...   0.0  0.01  36.19    78   610.33   500.00   2.54     0  8744.58   
 270  ...   0.0  0.01  47.32    64   992.47   282.84  11.70     0  4915.12   
 860  ...   0.0  0.01  13.95    82   524.79   127.28   6.87     0  3235.86   
 435  ...   0.0  0.00  85.08   133  6740.41  8789.57   1.00     0  9422.57   
 102  ...   0.0  0.01  39.12    78  1411.56   335.41   8.82     0  4880.79   
 
       f_48  
 757  65.96  
 693  66.01  
 854  65.98  
 501  36.22  
 664  66.06  
 ..     ...  
 106  65.98  
 270  65.92  
 860  65.71  
 435  36.59  
 102  66.18  
 
 [655 rows x 33 columns],
      f_1  f_2      f_3      f_4  f_5    f_7    f_8      f_9  f_10   f_11  ...  \
 321   29  105   881.92  1128.79   83  38.90   8.51   2710.0  0.22   96.9  ...   
 70    60  111  1153.32  1283.44   41  41.25   5.98   1760.0  0.14  157.7  ...   
 209   17  867  1059.49   581.31   46  31.08   8.26  15780.0  0.27  137.4  ...   
 656    9   85    71.06   469.47  140  70.85  11.28   4626.0  0.16  148.8  ...   
 685   38   15    32.47   582.13  156  73.27  12.11   1080.0  0.17  112.5  ...   
 ..   ...  ...      ...      ...  ...    ...    ...      ...   ...    ...  ...   
 430  183   51  1340.16   898.61   64  42.45   7.88   1430.0  0.19   89.2  ...   
 292  317  117  1269.88   917.89  123  29.16   8.85   2440.0  0.30  119.9  ...   
 412  151   64   991.70  1018.53  175  37.52   9.27   1400.0  0.25  114.3  ...   
 557   63   59  1253.20  1192.53   76  29.51   7.32   1664.0  0.25   49.9  ...   
 133  123   72  1606.14  1110.06   99  36.50   6.89   1760.0  0.19  102.3  ...   
 
      f_33  f_37    f_38  f_39     f_41     f_42  f_45  f_46      f_47   f_48  
 321   0.0  0.00   27.98    85   955.25   353.55  4.21     0   3425.75  65.97  
 70    0.0  0.00   11.16    78   710.63   500.00  2.40     0   5915.80  66.12  
 209   0.0  0.00  114.88    64  3146.82  1131.37  4.93     0   5679.31  65.74  
 656   0.0  0.00   31.08   102  1279.14   509.12  3.95     0   6376.53  65.98  
 685   0.0  0.01    9.60   102   685.42   201.25  6.47     0   3285.95  66.11  
 ..    ...   ...     ...   ...      ...      ...   ...   ...       ...    ...  
 430   0.0  0.01   16.04    85   604.15   269.26  4.07     0   5842.59  65.93  
 292   0.0  0.01   20.35    64   610.33   721.11  1.51     0   5512.84  65.93  
 412   0.0  0.01   12.25    85   450.00   300.00  1.76     0   2914.09  65.94  
 557   0.0  0.02   33.37   143   582.16   201.94  5.85     0  26008.35  36.85  
 133   0.0  0.01   17.21    78   559.02   304.14  2.30     0   4456.55  66.10  
 
 [282 rows x 33 columns])

In [ ]:

y_train, y_test

Out[ ]:

(757    0
 693    0
 854    0
 501    0
 664    1
       ..
 106    0
 270    0
 860    0
 435    0
 102    0
 Name: target, Length: 655, dtype: int64,
 321    0
 70     0
 209    0
 656    0
 685    0
       ..
 430    0
 292    0
 412    0
 557    0
 133    0
 Name: target, Length: 282, dtype: int64)

Standarizing the dataset

In [ ]:

sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)

Out[ ]:

array([[-0.84347997, -0.12493331,  0.30561268, ..., -0.38616422,
        -0.69720058,  0.41600293],
       [-0.36292832, -0.12189254,  0.75321954, ..., -0.38616422,
        -0.30840223,  0.43029588],
       [-1.02949997,  0.26124516,  0.59847027, ..., -0.38616422,
        -0.34532796,  0.39408706],
       ...,
       [ 1.0477233 , -0.14571195,  0.48666751, ..., -0.38616422,
        -0.77709157,  0.41314433],
       [-0.31642332, -0.14824593,  0.91794678, ..., -0.38616422,
         2.82886418, -2.35873678],
       [ 0.61367665, -0.14165758,  1.50003362, ..., -0.38616422,
        -0.53625066,  0.42839016]])

Model Selection - Evaluation (testing,scoring) & graph plots¶

1. Function to evaluate the model on various parameters & store the results¶

In [ ]:

# Function to evaluate and store results in a dictionary
def calculate_scores(model, X_train, y_train, X_test, y_test):
    train_score = accuracy_score(
        y_train, model.predict(X_train)
    )  # Calculate train score
    test_score = accuracy_score(
        y_test, model.predict(X_test))  # Calculate test score

    return train_score, test_score


def evaluate_model(model, model_name, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    # Mean Squared Error and R-squared Score
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Confusion Matrix and classification report
    cm = confusion_matrix(y_test, y_pred)
    acc = accuracy_score(y_test, y_pred)
    auc_score = roc_auc_score(y_test, y_prob)
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc_value = auc(fpr, tpr)

    cls_report = classification_report(y_test, y_pred, zero_division=0)

    # Display results in tabular format
    results_table = [
        ["Model", model_name],
        ["Mean Squared Error", mse],
        ["R-squared Score", r2],
        ["Confusion Matrix", f"{cm}"],
        ["True Positive", cm[0, 0]],
        ["False Negative", cm[0, 1]],
        ["False Positive", cm[1, 0]],
        ["True Negative", cm[1, 1]],
        ["Accuracy", acc],
        ["AUC", auc_score],
        ["Train Score", train_score],
        ["Test Score", test_score],
    ]

    print(tabulate(results_table, headers=[
          "Metric", "Value"], tablefmt="heavy_grid"))

    # Display Classification Report
    print("\nClassification Report:\n")
    print(cls_report)

    # Plot Confusion Matrix
    plt.matshow(cm, cmap=plt.cm.Reds)
    plt.title(f"Confusion Matrix for {model_name}")
    plt.colorbar()
    plt.xlabel("Predicted")
    plt.ylabel("True")
    # Add annotations to matrix
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, str(cm[i, j]), ha="center",
                     va="center", color="black")
    plt.show()

    # Store results in the dictionary
    return {
        "Model": model_name,
        "Mean Squared Error": mse,
        "R-squared Score": r2,
        "True Positive": cm[0, 0],
        "False Negative": cm[0, 1],
        "False Positive": cm[1, 0],
        "True Negative": cm[1, 1],
        "Accuracy": acc,
        "AUC": auc_score,
        "ROC Curve FPR": fpr,
        "ROC Curve TPR": tpr,
        "AUC Value": auc_value,
        "Confusion Matrix": cm,
        "Train Score": {train_score},
        "Test Score": {test_score},
    }

Plot for the graph of roc curve

In [ ]:

# Function to plot ROC curve
def plot_roc_curve(model, X_test, y_test):
    y_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    auc_value = auc(fpr, tpr)

    # Plot ROC curve
    plt.plot(fpr, tpr, color="orange",
             label=f"ROC Curve (AUC = {auc_value:.4f})")
    plt.plot([0, 1], [0, 1], label="TPR=FPR", linestyle="--")
    plt.title(f"ROC Curve for {model_name}")
    plt.xlabel("False Positive Rate (FPR)")
    plt.ylabel("True Positive Rate (TPR)")
    plt.grid()
    plt.legend()
    plt.show()

Model To look into

List of models to evaluate (just an example of parameters)¶

Models

models = [
    ("Logistic Regression", LogisticRegression(max_iter=1000, C=1.0, solver='lbfgs')),
    ("k-Nearest Neighbors", KNeighborsClassifier(n_neighbors=5, weights='uniform')),
    ("Decision Tree", DecisionTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1)),
    ("Random Forest", RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1)),
    ("AdaBoost", AdaBoostClassifier(n_estimators=50, learning_rate=1.0)),
    ("Bagging", BaggingClassifier(n_estimators=10, max_samples=1.0, max_features=1.0)),
    ("Gradient Boosting", GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)),
    ("Gaussian Naive Bayes", GaussianNB()),
    ("SVM", SVC(probability=True, C=1.0, kernel='rbf')),
]

3. Store the results of models built (by calling above evaluation code)¶

In [ ]:

# Placeholder for results
evaluation_results = []

Different Models Testing,evaulation & result/graph¶

1. Logistic Regression¶

In [ ]:

# 1. Passing the model name
model_name = "Logistic Regression"

# 2. model parameters
model = LogisticRegression(max_iter=1000, C=1.0, solver="lbfgs")

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 LogisticRegression(max_iter=1000) 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Logistic Regression  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.031914893617021274 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ 0.14860784971486074  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[266   5]           ┃
┃                    ┃  [  4   7]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 266                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 5                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 4                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 7                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9680851063829787   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.8436766185843676   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 0.9679389312977099   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9680851063829787   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.99      0.98      0.98       271
           1       0.58      0.64      0.61        11

    accuracy                           0.97       282
   macro avg       0.78      0.81      0.80       282
weighted avg       0.97      0.97      0.97       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

2. KNeighbors Classifier¶

In [ ]:

# 1. Passing the model name
model_name = "k-Nearest Neighbors"

# 2. model parameters
model = KNeighborsClassifier(n_neighbors=5, weights="uniform")

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 KNeighborsClassifier() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ k-Nearest Neighbors  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.04609929078014184  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ -0.22978866152297894 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[264   7]           ┃
┃                    ┃  [  6   5]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 264                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 7                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 6                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 5                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9539007092198581   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.8297551157329754   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 0.9526717557251908   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9539007092198581   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.97      0.98       271
           1       0.42      0.45      0.43        11

    accuracy                           0.95       282
   macro avg       0.70      0.71      0.71       282
weighted avg       0.96      0.95      0.95       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

KNeighborsClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

3. Decision Tree Classifier¶

In [ ]:

# 1. Passing the model name
model_name = "Decision Tree"

# 2. model parameters
model = DecisionTreeClassifier(
    max_depth=None, min_samples_split=2, min_samples_leaf=1)
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 DecisionTreeClassifier() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Decision Tree        ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.04609929078014184  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ -0.22978866152297894 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[264   7]           ┃
┃                    ┃  [  6   5]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 264                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 7                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 6                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 5                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9539007092198581   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.7143575981214358   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 1.0                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9539007092198581   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.97      0.98       271
           1       0.42      0.45      0.43        11

    accuracy                           0.95       282
   macro avg       0.70      0.71      0.71       282
weighted avg       0.96      0.95      0.95       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

4. Random Forest Classifier¶

In [ ]:

# 1. Passing the model name
model_name = "Random Forest"

# 2. model parameters
model = RandomForestClassifier(
    n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1
)

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 RandomForestClassifier() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Random Forest        ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.031914893617021274 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ 0.14860784971486074  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[270   1]           ┃
┃                    ┃  [  8   3]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 270                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 1                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 8                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 3                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9680851063829787   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.9050654142905066   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 1.0                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9680851063829787   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       271
           1       0.75      0.27      0.40        11

    accuracy                           0.97       282
   macro avg       0.86      0.63      0.69       282
weighted avg       0.96      0.97      0.96       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

RandomForestClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5. Ada Boost Classifier¶

In [ ]:

# 1. Passing the model name
model_name = "AdaBoost"

# 2. model parameters
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 AdaBoostClassifier() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ AdaBoost             ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.03546099290780142  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ 0.054008721905400736 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[268   3]           ┃
┃                    ┃  [  7   4]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 268                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 3                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 7                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 4                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9645390070921985   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.7926870177792688   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 1.0                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9645390070921985   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       271
           1       0.57      0.36      0.44        11

    accuracy                           0.96       282
   macro avg       0.77      0.68      0.71       282
weighted avg       0.96      0.96      0.96       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

AdaBoostClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

6. Bagging Classifier¶

In [ ]:

# 1. Passing the model name
model_name = "Bagging"

# 2. model parameters
model = BaggingClassifier(n_estimators=10, max_samples=1.0, max_features=1.0)

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 BaggingClassifier() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Bagging              ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.04609929078014184  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ -0.22978866152297894 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[267   4]           ┃
┃                    ┃  [  9   2]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 267                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 4                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 9                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 2                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9539007092198581   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.8596108688359612   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 0.9969465648854962   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9539007092198581   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.97      0.99      0.98       271
           1       0.33      0.18      0.24        11

    accuracy                           0.95       282
   macro avg       0.65      0.58      0.61       282
weighted avg       0.94      0.95      0.95       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

BaggingClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7. Gradient Boosting Classifier¶

In [ ]:

# 1. Passing the model name
model_name = "Gradient Boosting"

# 2. model parameters
model = GradientBoostingClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=3)

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 GradientBoostingClassifier() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Gradient Boosting    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.031914893617021274 ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ 0.14860784971486074  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[268   3]           ┃
┃                    ┃  [  6   5]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 268                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 3                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 6                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 5                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9680851063829787   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.858269037235827    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 1.0                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9680851063829787   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       271
           1       0.62      0.45      0.53        11

    accuracy                           0.97       282
   macro avg       0.80      0.72      0.75       282
weighted avg       0.96      0.97      0.97       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

GradientBoostingClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

8. Gaussian Naive Bayes¶

In [ ]:

# 1. Passing the model name
model_name = "Gaussian Naive Bayes"

# 2. model parameters
model = GaussianNB()

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 GaussianNB() 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Gaussian Naive Bayes ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.0851063829787234   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ -1.2703790674270383  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[251  20]           ┃
┃                    ┃  [  4   7]]          ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 251                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 20                   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 4                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 7                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9148936170212766   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.731969137873197    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 0.9267175572519084   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9148936170212766   ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.93      0.95       271
           1       0.26      0.64      0.37        11

    accuracy                           0.91       282
   macro avg       0.62      0.78      0.66       282
weighted avg       0.96      0.91      0.93       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

GaussianNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

9. SVM¶

In [ ]:

# 1. Passing the model name
model_name = "Support Vector Machine"

# 2. model parameters
model = SVC(probability=True, C=1.0, kernel="rbf")

# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")

# 3.2: Score Calculation
train_score, test_score = calculate_scores(
    model, X_train, y_train, X_test, y_test)

# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)

# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)

# Model Detail
model

 SVC(probability=True) 

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Metric             ┃ Value                  ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Model              ┃ Support Vector Machine ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Mean Squared Error ┃ 0.03900709219858156    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ R-squared Score    ┃ -0.04059040590405916   ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Confusion Matrix   ┃ [[271   0]             ┃
┃                    ┃  [ 11   0]]            ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Positive      ┃ 271                    ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Negative     ┃ 0                      ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ False Positive     ┃ 11                     ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ True Negative      ┃ 0                      ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Accuracy           ┃ 0.9609929078014184     ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ AUC                ┃ 0.9389466621938947     ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Train Score        ┃ 0.9603053435114504     ┃
┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ Test Score         ┃ 0.9609929078014184     ┃
┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┛

Classification Report:

              precision    recall  f1-score   support

           0       0.96      1.00      0.98       271
           1       0.00      0.00      0.00        11

    accuracy                           0.96       282
   macro avg       0.48      0.50      0.49       282
weighted avg       0.92      0.96      0.94       282

***************************************************************************

---------------------------------------------------------------------------

Out[ ]:

SVC(probability=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Results Check¶

In [ ]:

print(f"\nTotal models used & Evaluated : {len(evaluation_results)} & \nSaved parameters in each results :{len(results)}")
for result in evaluation_results:
    print("\n", result)

Total models used & Evaluated : 9 & 
Saved parameters in each results :15

 {'Model': 'Logistic Regression', 'Mean Squared Error': 0.031914893617021274, 'R-squared Score': 0.14860784971486074, 'True Positive': 266, 'False Negative': 5, 'False Positive': 4, 'True Negative': 7, 'Accuracy': 0.9680851063829787, 'AUC': 0.8436766185843676, 'ROC Curve FPR': array([0.        , 0.        , 0.        , 0.00369004, 0.00369004,
       0.11439114, 0.11439114, 0.32472325, 0.32472325, 0.35424354,
       0.35424354, 0.91512915, 0.91512915, 1.        ]), 'ROC Curve TPR': array([0.        , 0.09090909, 0.36363636, 0.36363636, 0.63636364,
       0.63636364, 0.72727273, 0.72727273, 0.81818182, 0.81818182,
       0.90909091, 0.90909091, 1.        , 1.        ]), 'AUC Value': 0.8436766185843676, 'Confusion Matrix': array([[266,   5],
       [  4,   7]]), 'Train Score': {0.9679389312977099}, 'Test Score': {0.9680851063829787}}

 {'Model': 'k-Nearest Neighbors', 'Mean Squared Error': 0.04609929078014184, 'R-squared Score': -0.22978866152297894, 'True Positive': 264, 'False Negative': 7, 'False Positive': 6, 'True Negative': 5, 'Accuracy': 0.9539007092198581, 'AUC': 0.8297551157329754, 'ROC Curve FPR': array([0.        , 0.00369004, 0.02583026, 0.05166052, 0.12177122,
       1.        ]), 'ROC Curve TPR': array([0.        , 0.        , 0.45454545, 0.72727273, 0.72727273,
       1.        ]), 'AUC Value': 0.8297551157329754, 'Confusion Matrix': array([[264,   7],
       [  6,   5]]), 'Train Score': {0.9526717557251908}, 'Test Score': {0.9539007092198581}}

 {'Model': 'Decision Tree', 'Mean Squared Error': 0.04609929078014184, 'R-squared Score': -0.22978866152297894, 'True Positive': 264, 'False Negative': 7, 'False Positive': 6, 'True Negative': 5, 'Accuracy': 0.9539007092198581, 'AUC': 0.7143575981214358, 'ROC Curve FPR': array([0.        , 0.02583026, 1.        ]), 'ROC Curve TPR': array([0.        , 0.45454545, 1.        ]), 'AUC Value': 0.7143575981214358, 'Confusion Matrix': array([[264,   7],
       [  6,   5]]), 'Train Score': {1.0}, 'Test Score': {0.9539007092198581}}

 {'Model': 'Random Forest', 'Mean Squared Error': 0.031914893617021274, 'R-squared Score': 0.14860784971486074, 'True Positive': 270, 'False Negative': 1, 'False Positive': 8, 'True Negative': 3, 'Accuracy': 0.9680851063829787, 'AUC': 0.9050654142905066, 'ROC Curve FPR': array([0.        , 0.00369004, 0.00369004, 0.00738007, 0.01476015,
       0.01476015, 0.01476015, 0.01845018, 0.01845018, 0.03690037,
       0.03690037, 0.04428044, 0.04797048, 0.05904059, 0.08856089,
       0.099631  , 0.11808118, 0.12915129, 0.15498155, 0.18819188,
       0.21402214, 0.25092251, 0.3099631 , 0.42804428, 0.60885609,
       1.        ]), 'ROC Curve TPR': array([0.        , 0.        , 0.27272727, 0.27272727, 0.27272727,
       0.45454545, 0.54545455, 0.54545455, 0.63636364, 0.63636364,
       0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.81818182,
       0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182,
       0.81818182, 0.81818182, 0.81818182, 0.90909091, 1.        ,
       1.        ]), 'AUC Value': 0.9050654142905066, 'Confusion Matrix': array([[270,   1],
       [  8,   3]]), 'Train Score': {1.0}, 'Test Score': {0.9680851063829787}}

 {'Model': 'AdaBoost', 'Mean Squared Error': 0.03546099290780142, 'R-squared Score': 0.054008721905400736, 'True Positive': 268, 'False Negative': 3, 'False Positive': 7, 'True Negative': 4, 'Accuracy': 0.9645390070921985, 'AUC': 0.7926870177792688, 'ROC Curve FPR': array([0.        , 0.00369004, 0.00369004, 0.00738007, 0.00738007,
       0.08487085, 0.08487085, 0.28413284, 0.28413284, 0.33579336,
       0.33579336, 0.35424354, 0.35424354, 0.36531365, 0.36531365,
       0.36900369, 0.36900369, 0.46494465, 0.46494465, 0.87822878,
       0.88560886, 1.        ]), 'ROC Curve TPR': array([0.        , 0.        , 0.18181818, 0.18181818, 0.36363636,
       0.36363636, 0.45454545, 0.45454545, 0.54545455, 0.54545455,
       0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182,
       0.81818182, 0.90909091, 0.90909091, 1.        , 1.        ,
       1.        , 1.        ]), 'AUC Value': 0.7926870177792688, 'Confusion Matrix': array([[268,   3],
       [  7,   4]]), 'Train Score': {1.0}, 'Test Score': {0.9645390070921985}}

 {'Model': 'Bagging', 'Mean Squared Error': 0.04609929078014184, 'R-squared Score': -0.22978866152297894, 'True Positive': 267, 'False Negative': 4, 'False Positive': 9, 'True Negative': 2, 'Accuracy': 0.9539007092198581, 'AUC': 0.8596108688359612, 'ROC Curve FPR': array([0.        , 0.        , 0.        , 0.01476015, 0.02214022,
       0.0295203 , 0.05535055, 0.099631  , 0.19557196, 1.        ]), 'ROC Curve TPR': array([0.        , 0.09090909, 0.18181818, 0.18181818, 0.45454545,
       0.54545455, 0.63636364, 0.72727273, 0.81818182, 1.        ]), 'AUC Value': 0.8596108688359612, 'Confusion Matrix': array([[267,   4],
       [  9,   2]]), 'Train Score': {0.9969465648854962}, 'Test Score': {0.9539007092198581}}

 {'Model': 'Gradient Boosting', 'Mean Squared Error': 0.031914893617021274, 'R-squared Score': 0.14860784971486074, 'True Positive': 268, 'False Negative': 3, 'False Positive': 6, 'True Negative': 5, 'Accuracy': 0.9680851063829787, 'AUC': 0.858269037235827, 'ROC Curve FPR': array([0.        , 0.        , 0.00369004, 0.00369004, 0.00738007,
       0.00738007, 0.01107011, 0.01107011, 0.02214022, 0.02214022,
       0.03690037, 0.03690037, 0.07749077, 0.07749077, 0.10332103,
       0.10332103, 0.39852399, 0.41328413, 0.43173432, 0.44280443,
       0.44649446, 0.45756458, 0.46494465, 0.4797048 , 0.48708487,
       0.50184502, 0.52398524, 0.5498155 , 0.58302583, 0.59409594,
       0.61254613, 0.62361624, 0.62730627, 0.65313653, 0.66420664,
       0.67158672, 0.71586716, 0.72324723, 0.73062731, 0.74169742,
       0.83763838, 0.84132841, 0.84870849, 0.8597786 , 0.86715867,
       0.87822878, 0.88929889, 0.89667897, 0.91512915, 0.95940959,
       0.96309963, 0.97416974, 0.98154982, 0.99630996, 1.        ]), 'ROC Curve TPR': array([0.        , 0.09090909, 0.09090909, 0.18181818, 0.18181818,
       0.36363636, 0.36363636, 0.45454545, 0.45454545, 0.54545455,
       0.54545455, 0.63636364, 0.63636364, 0.72727273, 0.72727273,
       0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091,
       0.90909091, 0.90909091, 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ]), 'AUC Value': 0.858269037235827, 'Confusion Matrix': array([[268,   3],
       [  6,   5]]), 'Train Score': {1.0}, 'Test Score': {0.9680851063829787}}

 {'Model': 'Gaussian Naive Bayes', 'Mean Squared Error': 0.0851063829787234, 'R-squared Score': -1.2703790674270383, 'True Positive': 251, 'False Negative': 20, 'False Positive': 4, 'True Negative': 7, 'Accuracy': 0.9148936170212766, 'AUC': 0.731969137873197, 'ROC Curve FPR': array([0.        , 0.        , 0.        , 0.00369004, 0.00369004,
       0.01107011, 0.01107011, 0.01476015, 0.01476015, 0.02214022,
       0.02214022, 0.46494465, 0.46494465, 0.46863469, 0.46863469,
       0.97785978, 0.97785978, 0.98523985, 0.98523985, 1.        ]), 'ROC Curve TPR': array([0.        , 0.09090909, 0.27272727, 0.27272727, 0.36363636,
       0.36363636, 0.45454545, 0.45454545, 0.54545455, 0.54545455,
       0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182,
       0.81818182, 0.90909091, 0.90909091, 1.        , 1.        ]), 'AUC Value': 0.731969137873197, 'Confusion Matrix': array([[251,  20],
       [  4,   7]]), 'Train Score': {0.9267175572519084}, 'Test Score': {0.9148936170212766}}

 {'Model': 'Support Vector Machine', 'Mean Squared Error': 0.03900709219858156, 'R-squared Score': -0.04059040590405916, 'True Positive': 271, 'False Negative': 0, 'False Positive': 11, 'True Negative': 0, 'Accuracy': 0.9609929078014184, 'AUC': 0.9389466621938947, 'ROC Curve FPR': array([0.        , 0.        , 0.00369004, 0.00369004, 0.00738007,
       0.00738007, 0.01845018, 0.01845018, 0.0295203 , 0.0295203 ,
       0.04428044, 0.04428044, 0.05166052, 0.05166052, 0.16605166,
       0.16605166, 0.30258303, 0.30258303, 1.        ]), 'ROC Curve TPR': array([0.        , 0.09090909, 0.09090909, 0.27272727, 0.27272727,
       0.36363636, 0.36363636, 0.45454545, 0.45454545, 0.54545455,
       0.54545455, 0.72727273, 0.72727273, 0.81818182, 0.81818182,
       0.90909091, 0.90909091, 1.        , 1.        ]), 'AUC Value': 0.9389466621938947, 'Confusion Matrix': array([[271,   0],
       [ 11,   0]]), 'Train Score': {0.9603053435114504}, 'Test Score': {0.9609929078014184}}

4. Model Comparison and Selection¶

Model performance table (Basic)¶

In [ ]:

model_performance = pd.DataFrame(evaluation_results)
model_performance

Out[ ]:

	Model	Mean Squared Error	R-squared Score	True Positive	False Negative	False Positive	True Negative	Accuracy	AUC	ROC Curve FPR	ROC Curve TPR	AUC Value	Confusion Matrix	Train Score	Test Score
0	Logistic Regression	0.031915	0.148608	266	5	4	7	0.968085	0.843677	[0.0, 0.0, 0.0, 0.0036900369003690036, 0.00369...	[0.0, 0.09090909090909091, 0.36363636363636365...	0.843677	[[266, 5], [4, 7]]	{0.9679389312977099}	{0.9680851063829787}
1	k-Nearest Neighbors	0.046099	-0.229789	264	7	6	5	0.953901	0.829755	[0.0, 0.0036900369003690036, 0.025830258302583...	[0.0, 0.0, 0.45454545454545453, 0.727272727272...	0.829755	[[264, 7], [6, 5]]	{0.9526717557251908}	{0.9539007092198581}
2	Decision Tree	0.046099	-0.229789	264	7	6	5	0.953901	0.714358	[0.0, 0.025830258302583026, 1.0]	[0.0, 0.45454545454545453, 1.0]	0.714358	[[264, 7], [6, 5]]	{1.0}	{0.9539007092198581}
3	Random Forest	0.031915	0.148608	270	1	8	3	0.968085	0.905065	[0.0, 0.0036900369003690036, 0.003690036900369...	[0.0, 0.0, 0.2727272727272727, 0.2727272727272...	0.905065	[[270, 1], [8, 3]]	{1.0}	{0.9680851063829787}
4	AdaBoost	0.035461	0.054009	268	3	7	4	0.964539	0.792687	[0.0, 0.0036900369003690036, 0.003690036900369...	[0.0, 0.0, 0.18181818181818182, 0.181818181818...	0.792687	[[268, 3], [7, 4]]	{1.0}	{0.9645390070921985}
5	Bagging	0.046099	-0.229789	267	4	9	2	0.953901	0.859611	[0.0, 0.0, 0.0, 0.014760147601476014, 0.022140...	[0.0, 0.09090909090909091, 0.18181818181818182...	0.859611	[[267, 4], [9, 2]]	{0.9969465648854962}	{0.9539007092198581}
6	Gradient Boosting	0.031915	0.148608	268	3	6	5	0.968085	0.858269	[0.0, 0.0, 0.0036900369003690036, 0.0036900369...	[0.0, 0.09090909090909091, 0.09090909090909091...	0.858269	[[268, 3], [6, 5]]	{1.0}	{0.9680851063829787}
7	Gaussian Naive Bayes	0.085106	-1.270379	251	20	4	7	0.914894	0.731969	[0.0, 0.0, 0.0, 0.0036900369003690036, 0.00369...	[0.0, 0.09090909090909091, 0.2727272727272727,...	0.731969	[[251, 20], [4, 7]]	{0.9267175572519084}	{0.9148936170212766}
8	Support Vector Machine	0.039007	-0.040590	271	0	11	0	0.960993	0.938947	[0.0, 0.0, 0.0036900369003690036, 0.0036900369...	[0.0, 0.09090909090909091, 0.09090909090909091...	0.938947	[[271, 0], [11, 0]]	{0.9603053435114504}	{0.9609929078014184}

Model Comparision & Best fit selection¶

In [ ]:

# Change the below metrics as needed say we only want to get the model on accuracy but then change the sorting parameter accordingly
selected_metrics_for_comparison = [
    "Mean Squared Error",
    "R-squared Score",
    "Accuracy",
    "AUC",
    "Precision",
    "Recall",
    "F1-score",
    "True Positive",
    "True Negative",
    "False Positive",
    "False Negative",
]
# Create a DataFrame to compare the models
df_results = pd.DataFrame(evaluation_results)

# Calculate additional metrics: Precision, Recall, F1-score, True Positive, True Negative, False Positive, False Negative
df_results["Precision"] = df_results["True Positive"] / \
    (df_results["True Positive"] + df_results["False Positive"])

df_results["Recall"] = df_results["True Positive"] / \
    (df_results["True Positive"] + df_results["False Negative"])

df_results["F1-score"] = 2 * (df_results["Precision"] * df_results["Recall"]) / \
    (df_results["Precision"] + df_results["Recall"])

# Sort the DataFrame based on the chosen metrics (lower is better for MSE, higher for others)
df_results_sorted = df_results.sort_values(
    by=selected_metrics_for_comparison, ascending=[ True, False, False, False, False, False, False, False, True, True, False])

# Display the comparison table
print("\nModel Comparison:")
print(tabulate(df_results_sorted[["Model"]+selected_metrics_for_comparison + ["Train Score", "Test Score"]], headers="keys", tablefmt="heavy_grid"))
# Select the best model based on the chosen metrics
best_models = {}
best_models_table = []  # Table to store the best models in tabular format

for metric in selected_metrics_for_comparison:
    best_model_idx = df_results_sorted[metric].idxmin(
    ) if "Error" in metric else df_results_sorted[metric].idxmax()
    best_models[metric] = df_results_sorted.loc[best_model_idx, "Model"]
    best_model_name = best_models[metric]
    best_model_value = df_results_sorted.loc[best_model_idx, metric]
    best_models_table.append([f"Best in ({metric})", best_model_name, best_model_value])

# Display the best models in tabular format
print("\nBest Models:")
print(tabulate(best_models_table, headers=["Metric", "Model Name", "Value"], tablefmt="heavy_grid"))

# Overall Best Model based on a consensus of multiple metrics
consensus_metrics = set(selected_metrics_for_comparison)
overall_best_model_idx = df_results_sorted[selected_metrics_for_comparison].mean(axis=1).idxmax()
overall_best_model = df_results_sorted.loc[overall_best_model_idx, "Model"]
print(f"\nOverall Best Model based on {', '.join(consensus_metrics)}: '{overall_best_model}'")

Model Comparison:
┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Model                  ┃   Mean Squared Error ┃   R-squared Score ┃   Accuracy ┃      AUC ┃   Precision ┃   Recall ┃   F1-score ┃   True Positive ┃   True Negative ┃   False Positive ┃   False Negative ┃ Train Score          ┃ Test Score           ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  3 ┃ Random Forest          ┃            0.0319149 ┃         0.148608  ┃   0.968085 ┃ 0.905065 ┃    0.971223 ┃ 0.99631  ┃   0.983607 ┃             270 ┃               3 ┃                8 ┃                1 ┃ {1.0}                ┃ {0.9680851063829787} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  6 ┃ Gradient Boosting      ┃            0.0319149 ┃         0.148608  ┃   0.968085 ┃ 0.858269 ┃    0.978102 ┃ 0.98893  ┃   0.983486 ┃             268 ┃               5 ┃                6 ┃                3 ┃ {1.0}                ┃ {0.9680851063829787} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  0 ┃ Logistic Regression    ┃            0.0319149 ┃         0.148608  ┃   0.968085 ┃ 0.843677 ┃    0.985185 ┃ 0.98155  ┃   0.983364 ┃             266 ┃               7 ┃                4 ┃                5 ┃ {0.9679389312977099} ┃ {0.9680851063829787} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  4 ┃ AdaBoost               ┃            0.035461  ┃         0.0540087 ┃   0.964539 ┃ 0.792687 ┃    0.974545 ┃ 0.98893  ┃   0.981685 ┃             268 ┃               4 ┃                7 ┃                3 ┃ {1.0}                ┃ {0.9645390070921985} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  8 ┃ Support Vector Machine ┃            0.0390071 ┃        -0.0405904 ┃   0.960993 ┃ 0.938947 ┃    0.960993 ┃ 1        ┃   0.980108 ┃             271 ┃               0 ┃               11 ┃                0 ┃ {0.9603053435114504} ┃ {0.9609929078014184} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  5 ┃ Bagging                ┃            0.0460993 ┃        -0.229789  ┃   0.953901 ┃ 0.859611 ┃    0.967391 ┃ 0.98524  ┃   0.976234 ┃             267 ┃               2 ┃                9 ┃                4 ┃ {0.9969465648854962} ┃ {0.9539007092198581} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  1 ┃ k-Nearest Neighbors    ┃            0.0460993 ┃        -0.229789  ┃   0.953901 ┃ 0.829755 ┃    0.977778 ┃ 0.97417  ┃   0.97597  ┃             264 ┃               5 ┃                6 ┃                7 ┃ {0.9526717557251908} ┃ {0.9539007092198581} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  2 ┃ Decision Tree          ┃            0.0460993 ┃        -0.229789  ┃   0.953901 ┃ 0.714358 ┃    0.977778 ┃ 0.97417  ┃   0.97597  ┃             264 ┃               5 ┃                6 ┃                7 ┃ {1.0}                ┃ {0.9539007092198581} ┃
┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃  7 ┃ Gaussian Naive Bayes   ┃            0.0851064 ┃        -1.27038   ┃   0.914894 ┃ 0.731969 ┃    0.984314 ┃ 0.926199 ┃   0.954373 ┃             251 ┃               7 ┃                4 ┃               20 ┃ {0.9267175572519084} ┃ {0.9148936170212766} ┃
┗━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

Best Models:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Metric                       ┃ Model Name             ┃       Value ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (Mean Squared Error) ┃ Random Forest          ┃   0.0319149 ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (R-squared Score)    ┃ Random Forest          ┃   0.148608  ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (Accuracy)           ┃ Random Forest          ┃   0.968085  ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (AUC)                ┃ Support Vector Machine ┃   0.938947  ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (Precision)          ┃ Logistic Regression    ┃   0.985185  ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (Recall)             ┃ Support Vector Machine ┃   1         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (F1-score)           ┃ Random Forest          ┃   0.983607  ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (True Positive)      ┃ Support Vector Machine ┃ 271         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (True Negative)      ┃ Logistic Regression    ┃   7         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (False Positive)     ┃ Support Vector Machine ┃  11         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫
┃ Best in (False Negative)     ┃ Gaussian Naive Bayes   ┃  20         ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━┛

Overall Best Model based on False Positive, Precision, F1-score, Mean Squared Error, True Positive, AUC, Recall, False Negative, Accuracy, True Negative, R-squared Score: 'Random Forest'

Code to automate the best model config by using the name from above calculation and below model dictionary¶

In [ ]:

# Models to evaulate the name and relevant parameter to take for best fit model fitting
models_dict = {
    "Logistic Regression": LogisticRegression(max_iter=1000, C=1.0, solver='lbfgs'),
    "k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5, weights='uniform'),
    "Decision Tree": DecisionTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1),
    "Random Forest": RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1),
    "AdaBoost": AdaBoostClassifier(n_estimators=50, learning_rate=1.0),
    "Bagging": BaggingClassifier(n_estimators=10, max_samples=1.0, max_features=1.0),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3),
    "Gaussian Naive Bayes": GaussianNB(),
    "SVM": SVC(probability=True, C=1.0, kernel='rbf')
}

# Retrieve the value using the key
retrieved_value = models_dict.get(overall_best_model)

if retrieved_value is not None:
    selected_model_name = overall_best_model
    selected_model = retrieved_value
    # selected_model_params = retrieved_value.get_params()
    print(f"Best Model Name: {selected_model_name}")
    print(f"\nRetrieved Model Instance: {selected_model}")

Best Model Name: Random Forest

Retrieved Model Instance: RandomForestClassifier()

Best Model Selection & saving as final model¶

In [ ]:

# Manually coding the best model name and changing the below parameter
# f_modelname = "Logistic Regression"
# final_model = LogisticRegression(max_iter=10000, C=1.0, solver="lbfgs")
# final_model.fit(x, y)

In [ ]:

# Automating the above hardcoded values by using the above dictonary and for loop
f_modelname = selected_model_name
f_model = selected_model
print(f"Best Selected Model name : '{f_modelname}' & \nits parameters :\n{f_model.get_params()}")
final_model = f_model
final_model.fit(x, y)

Best Selected Model name : 'Random Forest' & 
its parameters :
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Out[ ]:

RandomForestClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Q5) Save the best model and Load the model¶

Saving Model as pickle file and dumping it to use later on¶

In [ ]:

# wb - write binary file
pickle.dump(final_model, open(f"{f_modelname}.pkl", "wb"))

Loading the saved model¶

In [ ]:

load_model = pickle.load(open(f"{f_modelname}.pkl", "rb"))  # rb = read binary
print(f"Name of loaded Model : {f_modelname}")
load_model

Name of loaded Model : Random Forest

Out[ ]:

RandomForestClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

# Testing the imported model
print("Length of test data: ", len(load_model.predict(X_test)))
load_model.predict(X_test)

Length of test data:  282

Out[ ]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Q6) Take the original data set and make another dataset by randomly picking 20 data points from the oil spill dataset and applying the saved model to the same.¶

Generating sample data from cleaned df to test on the trained model.¶

In [ ]:

random_datasample = cleaned_df.sample(20)
random_datasample_df = random_datasample.drop("target", axis=1)
print(random_datasample_df.shape)
random_datasample_df.head()

(20, 33)

Out[ ]:

	f_1	f_2	f_3	f_4	f_5	f_7	f_8	f_9	f_10	f_11	...	f_37	f_38	f_39	f_41	f_42	f_45	f_46	f_47	f_48
855	123	12	116.33	377.75	59	52.42	4.44	1011.0	0.09	96.1	...	0.01	10.52	82	524.79	127.28	20.62	0	2475.04	65.88
148	139	56	1646.05	1534.18	55	31.73	5.42	1840.0	0.17	76.1	...	0.01	24.18	78	721.11	223.61	6.12	0	3352.35	66.31
728	81	10	47.20	651.80	37	71.50	8.11	704.0	0.11	115.1	...	0.01	6.12	102	402.49	0.00	0.00	1	4515.09	66.21
902	170	14	26.50	642.79	58	46.79	9.16	1048.0	0.20	108.2	...	0.00	9.69	82	402.49	127.28	4.22	0	4548.47	66.19
434	2	6099	673.25	1730.74	13	25.60	8.10	61516.5	0.32	139.4	...	0.01	441.23	133	0.00	0.00	0.00	0	13101.35	36.49

5 rows × 33 columns

Resetting the index as the randomly generated data has no continuos index (wil delete later,just for understanding)¶

In [ ]:

random_datasample_df.reset_index()

Out[ ]:

	index	f_1	f_2	f_3	f_4	f_5	f_7	f_8	f_9	f_10	...	f_37	f_38	f_39	f_41	f_42	f_45	f_46	f_47	f_48
0	855	123	12	116.33	377.75	59	52.42	4.44	1011.0	0.09	...	0.01	10.52	82	524.79	127.28	20.62	0	2475.04	65.88
1	148	139	56	1646.05	1534.18	55	31.73	5.42	1840.0	0.17	...	0.01	24.18	78	721.11	223.61	6.12	0	3352.35	66.31
2	728	81	10	47.20	651.80	37	71.50	8.11	704.0	0.11	...	0.01	6.12	102	402.49	0.00	0.00	1	4515.09	66.21
3	902	170	14	26.50	642.79	58	46.79	9.16	1048.0	0.20	...	0.00	9.69	82	402.49	127.28	4.22	0	4548.47	66.19
4	434	2	6099	673.25	1730.74	13	25.60	8.10	61516.5	0.32	...	0.01	441.23	133	0.00	0.00	0.00	0	13101.35	36.49
5	473	41	134	1260.22	1237.23	70	27.52	11.30	3374.5	0.41	...	0.02	60.43	133	877.85	391.51	4.42	0	8095.91	36.86
6	409	146	111	827.05	1260.37	118	40.58	6.66	2980.0	0.16	...	0.01	32.00	85	894.43	471.70	3.94	0	6277.01	66.03
7	96	86	86	769.73	1761.26	55	37.55	6.27	3090.0	0.17	...	0.01	44.41	78	1400.89	180.28	14.93	1	15720.91	66.30
8	235	103	214	1186.12	969.47	145	31.31	6.94	6440.0	0.22	...	0.01	77.52	64	1081.67	970.82	1.76	0	5037.66	65.94
9	362	82	71	104.75	1357.72	96	42.37	4.83	1710.0	0.11	...	0.01	16.47	85	608.28	300.00	2.70	0	32773.88	65.97
10	808	76	16	19.00	584.00	62	50.12	7.80	1154.0	0.16	...	0.01	10.28	82	649.00	127.28	10.20	1	3862.06	66.11
11	705	58	15	44.27	631.53	100	71.33	11.22	1191.0	0.16	...	0.01	11.67	102	270.00	270.00	1.50	0	3452.52	66.18
12	558	64	75	1054.00	2724.57	131	26.72	5.83	1845.0	0.22	...	0.02	32.28	143	0.00	0.00	0.00	0	24897.96	36.82
13	760	28	23	57.00	619.65	68	53.61	9.15	2075.0	0.17	...	0.00	23.11	82	1053.42	180.00	12.88	1	4371.27	66.17
14	425	177	59	844.44	1106.71	101	40.80	7.73	1910.0	0.19	...	0.01	24.73	85	509.90	304.14	2.45	0	4065.36	65.96
15	312	19	80	1001.17	876.95	7	39.35	9.33	1560.0	0.24	...	0.00	12.17	85	608.28	350.00	2.35	0	3169.32	65.87
16	874	142	16	6.19	509.25	73	55.81	9.17	1228.0	0.16	...	0.00	11.64	82	569.21	201.25	4.61	0	3807.97	66.01
17	748	16	22	31.23	412.77	62	49.77	8.49	1985.0	0.17	...	0.01	22.11	82	1049.57	127.28	41.23	1	2312.51	65.89
18	812	80	10	73.30	231.60	74	48.00	7.24	884.0	0.15	...	0.01	9.65	82	484.66	90.00	8.98	1	3644.29	65.67
19	159	150	119	1531.03	1772.45	158	37.89	8.56	3620.0	0.23	...	0.01	44.05	78	474.34	696.42	1.20	0	4759.79	66.41

20 rows × 34 columns

Saving the random sample dataset and removing the index¶

In [ ]:

random_datasample_df.to_csv("20_random_sample.csv", index=False)

Loading the sample data and checking basics¶

In [ ]:

testsample_df = pd.read_csv("20_random_sample.csv")
print(
    "Shape of loaded sample dataframe:",
    testsample_df.shape,
    "\n\nSample Dataframe contents",
)
testsample_df

Shape of loaded sample dataframe: (20, 33) 

Sample Dataframe contents

Out[ ]:

	f_1	f_2	f_3	f_4	f_5	f_7	f_8	f_9	f_10	f_11	...	f_37	f_38	f_39	f_41	f_42	f_45	f_46	f_47	f_48
0	123	12	116.33	377.75	59	52.42	4.44	1011.0	0.09	96.1	...	0.01	10.52	82	524.79	127.28	20.62	0	2475.04	65.88
1	139	56	1646.05	1534.18	55	31.73	5.42	1840.0	0.17	76.1	...	0.01	24.18	78	721.11	223.61	6.12	0	3352.35	66.31
2	81	10	47.20	651.80	37	71.50	8.11	704.0	0.11	115.1	...	0.01	6.12	102	402.49	0.00	0.00	1	4515.09	66.21
3	170	14	26.50	642.79	58	46.79	9.16	1048.0	0.20	108.2	...	0.00	9.69	82	402.49	127.28	4.22	0	4548.47	66.19
4	2	6099	673.25	1730.74	13	25.60	8.10	61516.5	0.32	139.4	...	0.01	441.23	133	0.00	0.00	0.00	0	13101.35	36.49
5	41	134	1260.22	1237.23	70	27.52	11.30	3374.5	0.41	55.8	...	0.02	60.43	133	877.85	391.51	4.42	0	8095.91	36.86
6	146	111	827.05	1260.37	118	40.58	6.66	2980.0	0.16	93.1	...	0.01	32.00	85	894.43	471.70	3.94	0	6277.01	66.03
7	86	86	769.73	1761.26	55	37.55	6.27	3090.0	0.17	69.6	...	0.01	44.41	78	1400.89	180.28	14.93	1	15720.91	66.30
8	103	214	1186.12	969.47	145	31.31	6.94	6440.0	0.22	83.1	...	0.01	77.52	64	1081.67	970.82	1.76	0	5037.66	65.94
9	82	71	104.75	1357.72	96	42.37	4.83	1710.0	0.11	103.8	...	0.01	16.47	85	608.28	300.00	2.70	0	32773.88	65.97
10	76	16	19.00	584.00	62	50.12	7.80	1154.0	0.16	112.3	...	0.01	10.28	82	649.00	127.28	10.20	1	3862.06	66.11
11	58	15	44.27	631.53	100	71.33	11.22	1191.0	0.16	102.0	...	0.01	11.67	102	270.00	270.00	1.50	0	3452.52	66.18
12	64	75	1054.00	2724.57	131	26.72	5.83	1845.0	0.22	57.2	...	0.02	32.28	143	0.00	0.00	0.00	0	24897.96	36.82
13	28	23	57.00	619.65	68	53.61	9.15	2075.0	0.17	89.8	...	0.00	23.11	82	1053.42	180.00	12.88	1	4371.27	66.17
14	177	59	844.44	1106.71	101	40.80	7.73	1910.0	0.19	77.2	...	0.01	24.73	85	509.90	304.14	2.45	0	4065.36	65.96
15	19	80	1001.17	876.95	7	39.35	9.33	1560.0	0.24	128.2	...	0.00	12.17	85	608.28	350.00	2.35	0	3169.32	65.87
16	142	16	6.19	509.25	73	55.81	9.17	1228.0	0.16	105.5	...	0.00	11.64	82	569.21	201.25	4.61	0	3807.97	66.01
17	16	22	31.23	412.77	62	49.77	8.49	1985.0	0.17	89.8	...	0.01	22.11	82	1049.57	127.28	41.23	1	2312.51	65.89
18	80	10	73.30	231.60	74	48.00	7.24	884.0	0.15	91.6	...	0.01	9.65	82	484.66	90.00	8.98	1	3644.29	65.67
19	150	119	1531.03	1772.45	158	37.89	8.56	3620.0	0.23	82.2	...	0.01	44.05	78	474.34	696.42	1.20	0	4759.79	66.41

20 rows × 33 columns

In [ ]:

# making prediction on random data
predicted_data = load_model.predict(testsample_df)
print(f"The predicted data from {f_modelname} model:\n", predicted_data)

The predicted data from Random Forest model:
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]

Comparision of Actual and Predicted values by the model¶

In [ ]:

#  Compare the actual data and predicted data
prediction_data = random_datasample.copy()
prediction_data["predicted_target"] = predicted_data

# Print the actual and predicted data
print(f"Actual Data and Predicted Data Comparision based on {f_modelname} model:\n")
# print(prediction_data[["target", "predicted_target"]])
comparision = {
    "Actual Target": random_datasample["target"], "Predicted Target": predicted_data}
final_results = pd.DataFrame(comparision)
final_results

Actual Data and Predicted Data Comparision based on Random Forest model:

Out[ ]:

	Actual Target	Predicted Target
855	0	0
148	0	0
728	0	0
902	0	0
434	0	0
473	0	0
409	0	0
96	0	0
235	0	0
362	1	1
808	0	0
705	0	0
558	0	0
760	0	0
425	0	0
312	0	0
874	0	0
748	0	0
812	0	0
159	0	0

Calculating the correctness of model¶

In [ ]:

# Calculate the number of correct predictions
correct_predictions = (
    prediction_data["predicted_target"] == prediction_data["target"]).sum()

# Calculate the percentage of correct predictions
percentage_correct_predictions = (
    correct_predictions / len(prediction_data)) * 100

# Print the result
print(f"\nPercentage of Correct Predictions: {percentage_correct_predictions:.2f}%")
if (percentage_correct_predictions >= 90):
    print(f"\nOur model based on '{f_modelname}' is well trained having prediction accuracy of {percentage_correct_predictions:.2f}%")
else:
    print(f"Our model based on '{f_modelname}' needs to be trained more to achieve atleast 95% prediction accuracy from our current results : {percentage_correct_predictions:.2f}%")

Percentage of Correct Predictions: 100.00%

Our model based on 'Random Forest' is well trained having prediction accuracy of 100.00%

Saving the results in a output file¶

In [ ]:

# Saving the final results in a output file

final_results.to_csv('final_results.csv', index=False)
with open('final_results.txt', 'w') as f:
    f.write(final_results.to_string())
with open('final_results.txt', 'a') as f:
    f.write(f"\n\n---------------------------------------\nPrinting the results of our {f_modelname} prediction on random 20 data samples.")
    f.write('\nNumber of correct predictions: {}\n'.format(sum(final_results['Actual Target'] == final_results['Predicted Target'])))
    f.write('Percentage of correct predictions: {}%'.format(100 * sum(final_results['Actual Target'] == final_results['Predicted Target']) / len(final_results)))
    # Print the result in output file
    if (percentage_correct_predictions >= 90):
      f.write(f"\nOur model based on '{f_modelname}' is well trained having prediction accuracy of {percentage_correct_predictions:.2f}%")
    else:
      f.write(f"Our model based on '{f_modelname}' needs to be trained more to achieve atleast 95% prediction accuracy from our current results : {percentage_correct_predictions:.2f}%")

	Actual Target	Predicted Target
855	0	0
148	0	0
728	0	0
902	0	0
434	0	0
473	0	0
409	0	0
96	0	0
235	0	0
362	1	1
808	0	0
705	0	0
558	0	0
760	0	0
425	0	0
312	0	0
874	0	0
748	0	0
812	0	0
159	0	0

	Actual Target	Predicted Target
855	0	0
148	0	0
728	0	0
902	0	0
434	0	0
473	0	0
409	0	0
96	0	0
235	0	0
362	1	1
808	0	0
705	0	0
558	0	0
760	0	0
425	0	0
312	0	0
874	0	0
748	0	0
812	0	0
159	0	0

	Actual Target	Predicted Target
855	0	0
148	0	0
728	0	0
902	0	0
434	0	0
473	0	0
409	0	0
96	0	0
235	0	0
362	1	1
808	0	0
705	0	0
558	0	0
760	0	0
425	0	0
312	0	0
874	0	0
748	0	0
812	0	0
159	0	0