Name : Sumit Shamlal Chaure
Batch : 10
Program : Data Science with Python By SkillAcademy
Assignment : Machine Learning Major Assignment
Topics : ML, Model Building, Test & Training
File Downloads :
My Reports & Files :
Note : Certain markdown linkings like page links wont work on google colab/Jupyter but the same on github or vs-code would take you to respective breakpoints as they have advanced markdown support for inline tagging and MD linkings.
Understanding the Problem Statement.
Data Collection (From Sources/API/Files).
Data Checking for analysis.
Exploratory Data Analysis (To get insights of dataset & problem)
Data Pre-Processing.
Model Selection & evaluation.
Model Training.
Choosing the Best Model for Best results.
Testing with new data & checking the factors such as recall, accuracy & precision.
Model Deployment
User testing & benchmarking etc.
Reiterating the steps with new data and building more accurate models.
Use the Oil Spill Dataset and solve the following question by using the dataset.
The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not.
Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.
The task is, given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean.
There are two classes and the goal is to distinguish between spill and non-spill using the features of a given ocean patch.
● Non-Spill: negative case, or majority class.
● Oil Spill: positive case, or minority class.
There are a total of 50 Columns in the Dataset, the output column is named as a target.
Download the Oil Spill Dataset and perform Data cleaning and Data Pre-Processing if Necessary.
Use various methods such as Handling null values, One-Hot Encoding, Imputation, and Scaling of Data Pre-Processing where necessary.
Derive some insights from the dataset.
Apply various Machine Learning techniques to predict the output in the target column, make use of Bagging and Ensemble as required, and find the best model by evaluating the model using Model evaluation techniques.
Save the best model and Load the model.
Take the original dataset and make another dataset by randomly picking 20 data points from the oil spill dataset and applying the saved model to the same.
import pandas as pd # for data cleaning and data pre-processing, CSV file I/O,etc
import numpy as np # linear algebra & for mathematical computation
import matplotlib.pyplot as plt # for visualization
%matplotlib inline
import seaborn as sns # for visualization
from collections import Counter # to count occurrences
from tabulate import tabulate # to make tables for results
import warnings # for warning removals in code output
warnings.filterwarnings('ignore')
# Scalers & Encoders
from sklearn.preprocessing import StandardScaler, LabelEncoder
#train-test split
from sklearn.model_selection import train_test_split
# Metrics
from sklearn.metrics import (mean_squared_error, r2_score,confusion_matrix, classification_report, accuracy_score,roc_auc_score, roc_curve, auc)
# Model Libraries
from sklearn.linear_model import (LinearRegression, LogisticRegression)
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import (RandomForestClassifier,BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier)
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import pickle #to save and load model files as pkl file
# 2.1) Importing the dataset (With error handling)
# If you want to upload the dataset directly (Since on Google Colab it will be lost on re-run) - uncomment the below 2 line code and run
from google.colab import files
uploaded = files.upload()
file_path = "oil_spill.csv"
file_name = file_path.split("/")[-1]
try:
# Reading the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)
# Store the filename as an attribute in the DataFrame
df.file_name = file_name
print(f"\n '{df.file_name}' loaded successfully.")
# Exception to check if the file has some error like no file at the path, etc.
except FileNotFoundError:
print(f"Error: '{file_name}' not found at the specified location {file_path}.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Saving oil_spill.csv to oil_spill.csv 'oil_spill.csv' loaded successfully.
df
f_1 | f_2 | f_3 | f_4 | f_5 | f_6 | f_7 | f_8 | f_9 | f_10 | ... | f_41 | f_42 | f_43 | f_44 | f_45 | f_46 | f_47 | f_48 | f_49 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2558 | 1506.09 | 456.63 | 90 | 6395000 | 40.88 | 7.89 | 29780.0 | 0.19 | ... | 2850.00 | 1000.00 | 763.16 | 135.46 | 3.73 | 0 | 33243.19 | 65.74 | 7.95 | 1 |
1 | 2 | 22325 | 79.11 | 841.03 | 180 | 55812500 | 51.11 | 1.21 | 61900.0 | 0.02 | ... | 5750.00 | 11500.00 | 9593.48 | 1648.80 | 0.60 | 0 | 51572.04 | 65.73 | 6.26 | 0 |
2 | 3 | 115 | 1449.85 | 608.43 | 88 | 287500 | 40.42 | 7.34 | 3340.0 | 0.18 | ... | 1400.00 | 250.00 | 150.00 | 45.13 | 9.33 | 1 | 31692.84 | 65.81 | 7.84 | 1 |
3 | 4 | 1201 | 1562.53 | 295.65 | 66 | 3002500 | 42.40 | 7.97 | 18030.0 | 0.19 | ... | 6041.52 | 761.58 | 453.21 | 144.97 | 13.33 | 1 | 37696.21 | 65.67 | 8.07 | 1 |
4 | 5 | 312 | 950.27 | 440.86 | 37 | 780000 | 41.43 | 7.03 | 3350.0 | 0.17 | ... | 1320.04 | 710.63 | 512.54 | 109.16 | 2.58 | 0 | 29038.17 | 65.66 | 7.35 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
932 | 200 | 12 | 92.42 | 364.42 | 135 | 97200 | 59.42 | 10.34 | 884.0 | 0.17 | ... | 381.84 | 254.56 | 84.85 | 146.97 | 4.50 | 0 | 2593.50 | 65.85 | 6.39 | 0 |
933 | 201 | 11 | 98.82 | 248.64 | 159 | 89100 | 59.64 | 10.18 | 831.0 | 0.17 | ... | 284.60 | 180.00 | 150.00 | 51.96 | 1.90 | 0 | 4361.25 | 65.70 | 6.53 | 0 |
934 | 202 | 14 | 25.14 | 428.86 | 24 | 113400 | 60.14 | 17.94 | 847.0 | 0.30 | ... | 402.49 | 180.00 | 180.00 | 0.00 | 2.24 | 0 | 2153.05 | 65.91 | 6.12 | 0 |
935 | 203 | 10 | 96.00 | 451.30 | 68 | 81000 | 59.90 | 15.01 | 831.0 | 0.25 | ... | 402.49 | 180.00 | 90.00 | 73.48 | 4.47 | 0 | 2421.43 | 65.97 | 6.32 | 0 |
936 | 204 | 11 | 7.73 | 235.73 | 135 | 89100 | 61.82 | 12.24 | 831.0 | 0.20 | ... | 254.56 | 254.56 | 127.28 | 180.00 | 2.00 | 0 | 3782.68 | 65.65 | 6.26 | 0 |
937 rows × 50 columns
Insights: As the assignment is part of major submission i tried to add few exception handling steps to verify things like file not found error. (If we try to import any empty csv in the dataframe it will prompt an exception suggesting no file found at the specified location.)
- I have added an extra commented code for google colab imports directly to upload the csv file from local drive and then start the further process.
- To use on google colab uncomment the start line of code to import the file on google colab drive instance.
- I have not added extraneous code to get the filename from uploaded files and save them directly instead hardcoded/kept it so that even if code is run offline it will work perfectly fine.
Note: As Q.2 needs to do data pre-processing i have added the data cleaning steps in that section directly.
# Check the shape of the DataFrame
print("\nShape of the DataFrame:")
print(df.shape)
print(df.size)
num_rows, num_columns = df.shape
print(f"Rows: {num_rows}, Columns: {num_columns}")
# Display information about the dataset
print(f"\nDataset information for {df.file_name}:")
df.head(3)
df.tail(3)
print("\nDataset information:")
print(df.info())
Shape of the DataFrame: (937, 50) 46850 Rows: 937, Columns: 50 Dataset information for oil_spill.csv: Dataset information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 937 entries, 0 to 936 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 f_1 937 non-null int64 1 f_2 937 non-null int64 2 f_3 937 non-null float64 3 f_4 937 non-null float64 4 f_5 937 non-null int64 5 f_6 937 non-null int64 6 f_7 937 non-null float64 7 f_8 937 non-null float64 8 f_9 937 non-null float64 9 f_10 937 non-null float64 10 f_11 937 non-null float64 11 f_12 937 non-null float64 12 f_13 937 non-null float64 13 f_14 937 non-null float64 14 f_15 937 non-null float64 15 f_16 937 non-null float64 16 f_17 937 non-null float64 17 f_18 937 non-null float64 18 f_19 937 non-null float64 19 f_20 937 non-null float64 20 f_21 937 non-null float64 21 f_22 937 non-null float64 22 f_23 937 non-null int64 23 f_24 937 non-null float64 24 f_25 937 non-null float64 25 f_26 937 non-null float64 26 f_27 937 non-null float64 27 f_28 937 non-null float64 28 f_29 937 non-null float64 29 f_30 937 non-null float64 30 f_31 937 non-null float64 31 f_32 937 non-null float64 32 f_33 937 non-null float64 33 f_34 937 non-null float64 34 f_35 937 non-null int64 35 f_36 937 non-null int64 36 f_37 937 non-null float64 37 f_38 937 non-null float64 38 f_39 937 non-null int64 39 f_40 937 non-null int64 40 f_41 937 non-null float64 41 f_42 937 non-null float64 42 f_43 937 non-null float64 43 f_44 937 non-null float64 44 f_45 937 non-null float64 45 f_46 937 non-null int64 46 f_47 937 non-null float64 47 f_48 937 non-null float64 48 f_49 937 non-null float64 49 target 937 non-null int64 dtypes: float64(39), int64(11) memory usage: 366.1 KB None
# Display the columns & rows of dataset
print(f"The columns of our {file_name} dataframe\n")
print(df.columns)
The columns of our oil_spill.csv dataframe Index(['f_1', 'f_2', 'f_3', 'f_4', 'f_5', 'f_6', 'f_7', 'f_8', 'f_9', 'f_10', 'f_11', 'f_12', 'f_13', 'f_14', 'f_15', 'f_16', 'f_17', 'f_18', 'f_19', 'f_20', 'f_21', 'f_22', 'f_23', 'f_24', 'f_25', 'f_26', 'f_27', 'f_28', 'f_29', 'f_30', 'f_31', 'f_32', 'f_33', 'f_34', 'f_35', 'f_36', 'f_37', 'f_38', 'f_39', 'f_40', 'f_41', 'f_42', 'f_43', 'f_44', 'f_45', 'f_46', 'f_47', 'f_48', 'f_49', 'target'], dtype='object')
print("Missing values in the dataset:\n")
print(df.isnull().sum())
# NA value calculation
nullval = df.isna().sum()
nullval = nullval[nullval > 0]
print("\nSum of Missing values:\n", nullval)
Missing values in the dataset: f_1 0 f_2 0 f_3 0 f_4 0 f_5 0 f_6 0 f_7 0 f_8 0 f_9 0 f_10 0 f_11 0 f_12 0 f_13 0 f_14 0 f_15 0 f_16 0 f_17 0 f_18 0 f_19 0 f_20 0 f_21 0 f_22 0 f_23 0 f_24 0 f_25 0 f_26 0 f_27 0 f_28 0 f_29 0 f_30 0 f_31 0 f_32 0 f_33 0 f_34 0 f_35 0 f_36 0 f_37 0 f_38 0 f_39 0 f_40 0 f_41 0 f_42 0 f_43 0 f_44 0 f_45 0 f_46 0 f_47 0 f_48 0 f_49 0 target 0 dtype: int64 Sum of Missing values: Series([], dtype: int64)
No Null values or NA are present in our dataset.
print("\nChecking for duplicated values:\n")
print(df.duplicated())
print("\nSum of Duplicated Values in Dataframe :", df.duplicated().sum())
Checking for duplicated values: 0 False 1 False 2 False 3 False 4 False ... 932 False 933 False 934 False 935 False 936 False Length: 937, dtype: bool Sum of Duplicated Values in Dataframe : 0
The sum of duplicated value comes to be zero indicating that no duplicates are present in the dataset.
# Calculating the number of missing values or null values in df
total_missing_values = df.isnull().sum().sum()
print("The number of missing values/NA in dataframe :", total_missing_values)
# Calculate the total number of values in df (excluding the missing values)
total_values = df.size
print("Total number of values in dataframe :", total_values)
# percentage of missing values or null values in df
percentage_missing_values = (total_missing_values / total_values) * 100
print("Percentage of missing values or null values in df :",
percentage_missing_values)
The number of missing values/NA in dataframe : 0 Total number of values in dataframe : 46850 Percentage of missing values or null values in df : 0.0
No Missing values are present in the dataset.
print("Value counts of the dataset by datatypes")
df.dtypes.value_counts()
Value counts of the dataset by datatypes
float64 39 int64 11 dtype: int64
print("Unique Value counts inside each columns")
df.nunique()
Unique Value counts inside each columns
f_1 238 f_2 297 f_3 927 f_4 933 f_5 179 f_6 375 f_7 820 f_8 618 f_9 561 f_10 57 f_11 577 f_12 59 f_13 73 f_14 107 f_15 53 f_16 91 f_17 893 f_18 810 f_19 170 f_20 53 f_21 68 f_22 9 f_23 1 f_24 92 f_25 9 f_26 8 f_27 9 f_28 308 f_29 447 f_30 392 f_31 107 f_32 42 f_33 4 f_34 45 f_35 141 f_36 110 f_37 3 f_38 758 f_39 9 f_40 9 f_41 388 f_42 220 f_43 644 f_44 649 f_45 499 f_46 2 f_47 937 f_48 169 f_49 286 target 2 dtype: int64
From the above unique value check many things can be observed which we will need to dive deep in further analysis.
# Check data types of each column
print("Data types of each column:")
df.dtypes
Data types of each column:
f_1 int64 f_2 int64 f_3 float64 f_4 float64 f_5 int64 f_6 int64 f_7 float64 f_8 float64 f_9 float64 f_10 float64 f_11 float64 f_12 float64 f_13 float64 f_14 float64 f_15 float64 f_16 float64 f_17 float64 f_18 float64 f_19 float64 f_20 float64 f_21 float64 f_22 float64 f_23 int64 f_24 float64 f_25 float64 f_26 float64 f_27 float64 f_28 float64 f_29 float64 f_30 float64 f_31 float64 f_32 float64 f_33 float64 f_34 float64 f_35 int64 f_36 int64 f_37 float64 f_38 float64 f_39 int64 f_40 int64 f_41 float64 f_42 float64 f_43 float64 f_44 float64 f_45 float64 f_46 int64 f_47 float64 f_48 float64 f_49 float64 target int64 dtype: object
print("\nSummary statistics:\n")
print(df.describe())
Summary statistics: f_1 f_2 f_3 f_4 f_5 \ count 937.000000 937.000000 937.000000 937.000000 937.000000 mean 81.588047 332.842049 698.707086 870.992209 84.121665 std 64.976730 1931.938570 599.965577 522.799325 45.361771 min 1.000000 10.000000 1.920000 1.000000 0.000000 25% 31.000000 20.000000 85.270000 444.200000 54.000000 50% 64.000000 65.000000 704.370000 761.280000 73.000000 75% 124.000000 132.000000 1223.480000 1260.370000 117.000000 max 352.000000 32389.000000 1893.080000 2724.570000 180.000000 f_6 f_7 f_8 f_9 f_10 ... \ count 9.370000e+02 937.000000 937.000000 937.000000 937.000000 ... mean 7.696964e+05 43.242721 9.127887 3940.712914 0.221003 ... std 3.831151e+06 12.718404 3.588878 8167.427625 0.090316 ... min 7.031200e+04 21.240000 0.830000 667.000000 0.020000 ... 25% 1.250000e+05 33.650000 6.750000 1371.000000 0.160000 ... 50% 1.863000e+05 39.970000 8.200000 2090.000000 0.200000 ... 75% 3.304680e+05 52.420000 10.760000 3435.000000 0.260000 ... max 7.131500e+07 82.640000 24.690000 160740.000000 0.740000 ... f_41 f_42 f_43 f_44 f_45 \ count 937.000000 937.000000 937.000000 937.000000 937.000000 mean 933.928677 427.565582 255.435902 106.112519 5.014002 std 1001.681331 715.391648 534.306194 135.617708 5.029151 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 450.000000 180.000000 90.800000 50.120000 2.370000 50% 685.420000 270.000000 161.650000 73.850000 3.850000 75% 1053.420000 460.980000 265.510000 125.810000 6.320000 max 11949.330000 11500.000000 9593.480000 1748.130000 76.630000 f_46 f_47 f_48 f_49 target count 937.000000 937.000000 937.000000 937.000000 937.000000 mean 0.128068 7985.718004 61.694386 8.119723 0.043757 std 0.334344 6854.504915 10.412807 2.908895 0.204662 min 0.000000 2051.500000 35.950000 5.810000 0.000000 25% 0.000000 3760.570000 65.720000 6.340000 0.000000 50% 0.000000 5509.430000 65.930000 7.220000 0.000000 75% 0.000000 9521.930000 66.130000 7.840000 0.000000 max 1.000000 55128.460000 66.450000 15.440000 1.000000 [8 rows x 50 columns]
df.select_dtypes(include="category")
categorical_columns = df.select_dtypes(include=["object"]).columns
print("\nCategorical columns:")
print(categorical_columns)
Categorical columns: Index([], dtype='object')
print(f"The descriptive Stats for the {file_name} dataset:")
df.describe()
The descriptive Stats for the oil_spill.csv dataset:
f_1 | f_2 | f_3 | f_4 | f_5 | f_6 | f_7 | f_8 | f_9 | f_10 | ... | f_41 | f_42 | f_43 | f_44 | f_45 | f_46 | f_47 | f_48 | f_49 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 9.370000e+02 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | ... | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 | 937.000000 |
mean | 81.588047 | 332.842049 | 698.707086 | 870.992209 | 84.121665 | 7.696964e+05 | 43.242721 | 9.127887 | 3940.712914 | 0.221003 | ... | 933.928677 | 427.565582 | 255.435902 | 106.112519 | 5.014002 | 0.128068 | 7985.718004 | 61.694386 | 8.119723 | 0.043757 |
std | 64.976730 | 1931.938570 | 599.965577 | 522.799325 | 45.361771 | 3.831151e+06 | 12.718404 | 3.588878 | 8167.427625 | 0.090316 | ... | 1001.681331 | 715.391648 | 534.306194 | 135.617708 | 5.029151 | 0.334344 | 6854.504915 | 10.412807 | 2.908895 | 0.204662 |
min | 1.000000 | 10.000000 | 1.920000 | 1.000000 | 0.000000 | 7.031200e+04 | 21.240000 | 0.830000 | 667.000000 | 0.020000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2051.500000 | 35.950000 | 5.810000 | 0.000000 |
25% | 31.000000 | 20.000000 | 85.270000 | 444.200000 | 54.000000 | 1.250000e+05 | 33.650000 | 6.750000 | 1371.000000 | 0.160000 | ... | 450.000000 | 180.000000 | 90.800000 | 50.120000 | 2.370000 | 0.000000 | 3760.570000 | 65.720000 | 6.340000 | 0.000000 |
50% | 64.000000 | 65.000000 | 704.370000 | 761.280000 | 73.000000 | 1.863000e+05 | 39.970000 | 8.200000 | 2090.000000 | 0.200000 | ... | 685.420000 | 270.000000 | 161.650000 | 73.850000 | 3.850000 | 0.000000 | 5509.430000 | 65.930000 | 7.220000 | 0.000000 |
75% | 124.000000 | 132.000000 | 1223.480000 | 1260.370000 | 117.000000 | 3.304680e+05 | 52.420000 | 10.760000 | 3435.000000 | 0.260000 | ... | 1053.420000 | 460.980000 | 265.510000 | 125.810000 | 6.320000 | 0.000000 | 9521.930000 | 66.130000 | 7.840000 | 0.000000 |
max | 352.000000 | 32389.000000 | 1893.080000 | 2724.570000 | 180.000000 | 7.131500e+07 | 82.640000 | 24.690000 | 160740.000000 | 0.740000 | ... | 11949.330000 | 11500.000000 | 9593.480000 | 1748.130000 | 76.630000 | 1.000000 | 55128.460000 | 66.450000 | 15.440000 | 1.000000 |
8 rows × 50 columns
print("Complete Stats of every column")
print(df.describe())
Complete Stats of every column f_1 f_2 f_3 f_4 f_5 \ count 937.000000 937.000000 937.000000 937.000000 937.000000 mean 81.588047 332.842049 698.707086 870.992209 84.121665 std 64.976730 1931.938570 599.965577 522.799325 45.361771 min 1.000000 10.000000 1.920000 1.000000 0.000000 25% 31.000000 20.000000 85.270000 444.200000 54.000000 50% 64.000000 65.000000 704.370000 761.280000 73.000000 75% 124.000000 132.000000 1223.480000 1260.370000 117.000000 max 352.000000 32389.000000 1893.080000 2724.570000 180.000000 f_6 f_7 f_8 f_9 f_10 ... \ count 9.370000e+02 937.000000 937.000000 937.000000 937.000000 ... mean 7.696964e+05 43.242721 9.127887 3940.712914 0.221003 ... std 3.831151e+06 12.718404 3.588878 8167.427625 0.090316 ... min 7.031200e+04 21.240000 0.830000 667.000000 0.020000 ... 25% 1.250000e+05 33.650000 6.750000 1371.000000 0.160000 ... 50% 1.863000e+05 39.970000 8.200000 2090.000000 0.200000 ... 75% 3.304680e+05 52.420000 10.760000 3435.000000 0.260000 ... max 7.131500e+07 82.640000 24.690000 160740.000000 0.740000 ... f_41 f_42 f_43 f_44 f_45 \ count 937.000000 937.000000 937.000000 937.000000 937.000000 mean 933.928677 427.565582 255.435902 106.112519 5.014002 std 1001.681331 715.391648 534.306194 135.617708 5.029151 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 450.000000 180.000000 90.800000 50.120000 2.370000 50% 685.420000 270.000000 161.650000 73.850000 3.850000 75% 1053.420000 460.980000 265.510000 125.810000 6.320000 max 11949.330000 11500.000000 9593.480000 1748.130000 76.630000 f_46 f_47 f_48 f_49 target count 937.000000 937.000000 937.000000 937.000000 937.000000 mean 0.128068 7985.718004 61.694386 8.119723 0.043757 std 0.334344 6854.504915 10.412807 2.908895 0.204662 min 0.000000 2051.500000 35.950000 5.810000 0.000000 25% 0.000000 3760.570000 65.720000 6.340000 0.000000 50% 0.000000 5509.430000 65.930000 7.220000 0.000000 75% 0.000000 9521.930000 66.130000 7.840000 0.000000 max 1.000000 55128.460000 66.450000 15.440000 1.000000 [8 rows x 50 columns]
# Basic Class summary
print("\nClass distribution:\n")
print(df["target"].value_counts())
# summarize the class distribution
target = df.values[:, -1]
counter = Counter(target)
print("\nClass Distribution Summary:\n")
for k, v in counter.items():
per = v / len(target) * 100
print("Class=%d, Count=%d, Percentage=%.3f%%" % (k, v, per))
Class distribution: 0 896 1 41 Name: target, dtype: int64 Class Distribution Summary: Class=1, Count=41, Percentage=4.376% Class=0, Count=896, Percentage=95.624%
# Countplot for Target Variable
ax = sns.countplot(x=df["target"], palette="husl", alpha=0.7)
plt.title("Countplot for Target Column")
plt.xlabel("Target Variable")
plt.ylabel("Count")
# Loop for annotation
for p in ax.patches:
ax.text(
p.get_x() + p.get_width() / 2.0,
p.get_height(),
f"{p.get_height()}",
ha="center",
va="bottom",
fontsize=8,
color="black",
)
plt.show()
The target column on which we need to work for our classification shows that the dataset indicates:
More info on color pallete
fig = plt.figure(figsize=(25, 35))
ax = fig.gca()
_ = df.hist(ax=ax, color="green", edgecolor="black")
# Add a title at the top of the subplots
plt.suptitle("Histograms of DataFrame Columns", y=0.90, fontsize=24)
plt.show()
piechart = df["target"].value_counts()
# Create a pie chart with labels, numbers, and percentages
plt.pie(piechart, labels=["No-Spill", "Oil-Spill"],
autopct="%0.1f%%", radius=1)
plt.title("Distribution of Spill Classes")
plt.show()
f_23_distribution = df["f_23"].value_counts()
labels = f_23_distribution.index.map(str)
values = f_23_distribution.values
plt.pie(values, labels=labels, autopct="%0.1f%%", radius=1)
plt.title(f'Distribution of column "F_23"')
plt.show()
# Set the number of subplots per row
subplots_per_row = 5
# Calculate the number of rows needed based on the number of columns and subplots per row
num_rows = (len(df.columns) - 1) // subplots_per_row + 1
# Set up the subplots
fig, axes = plt.subplots(
nrows=num_rows, ncols=subplots_per_row, figsize=(25, 25))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Iterate over each column and create boxplots
for ax, column in zip(axes, df.columns):
sns.boxplot(x=df[column], ax=ax, color="orange", width=0.5)
ax.set_title(column, fontsize=14)
ax.set_xlabel("Count")
ax.set_ylabel("Values")
# Adjust layout for better spacing between subplots
plt.tight_layout()
# Add a common title at the top of the subplots
fig.suptitle(
"Boxplots of DataFrame Columns (To Display Outliers in Dataset)",
y=1.02,
fontsize=24,
)
# Show the plots
plt.show()
corr_matrix = df.corr()
df.corr()
f_1 | f_2 | f_3 | f_4 | f_5 | f_6 | f_7 | f_8 | f_9 | f_10 | ... | f_41 | f_42 | f_43 | f_44 | f_45 | f_46 | f_47 | f_48 | f_49 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f_1 | 1.000000 | -0.155581 | 0.172017 | -0.104116 | -0.017025 | -0.169533 | -0.037412 | -0.204983 | -0.244551 | -0.214447 | ... | -0.286190 | -0.167466 | -0.156916 | -0.141792 | -0.139478 | -0.163693 | -0.202983 | 0.294422 | -0.253698 | -0.180531 |
f_2 | -0.155581 | 1.000000 | 0.058390 | 0.052638 | -0.036870 | 0.953947 | -0.136761 | -0.016822 | 0.829978 | 0.128465 | ... | 0.555154 | 0.777807 | 0.800939 | 0.716496 | -0.080879 | -0.048315 | 0.118792 | -0.128222 | 0.139417 | 0.034128 |
f_3 | 0.172017 | 0.058390 | 1.000000 | 0.549510 | -0.082764 | 0.050795 | -0.627934 | -0.349541 | 0.158686 | 0.073794 | ... | 0.186920 | 0.178287 | 0.129653 | 0.176883 | -0.088310 | -0.182458 | -0.022098 | 0.048291 | 0.162600 | -0.035221 |
f_4 | -0.104116 | 0.052638 | 0.549510 | 1.000000 | 0.048847 | 0.024693 | -0.546205 | -0.222063 | 0.097683 | 0.202167 | ... | -0.046934 | 0.032402 | 0.022234 | 0.000664 | -0.220461 | -0.204776 | 0.106758 | -0.394081 | 0.476127 | -0.050489 |
f_5 | -0.017025 | -0.036870 | -0.082764 | 0.048847 | 1.000000 | -0.028431 | 0.059128 | 0.123814 | -0.047879 | 0.098573 | ... | -0.066930 | -0.014877 | -0.013742 | -0.012346 | -0.076695 | -0.080136 | 0.070070 | -0.135294 | 0.116896 | -0.078598 |
f_6 | -0.169533 | 0.953947 | 0.050795 | 0.024693 | -0.028431 | 1.000000 | -0.093589 | -0.001395 | 0.894150 | 0.097449 | ... | 0.594273 | 0.844597 | 0.868353 | 0.770044 | -0.077783 | -0.046834 | 0.126850 | -0.058752 | 0.069731 | 0.049318 |
f_7 | -0.037412 | -0.136761 | -0.627934 | -0.546205 | 0.059128 | -0.093589 | 1.000000 | 0.381206 | -0.188076 | -0.380340 | ... | -0.115014 | -0.100003 | -0.074308 | -0.073751 | 0.077207 | 0.088633 | -0.157243 | 0.483034 | -0.612819 | -0.026183 |
f_8 | -0.204983 | -0.016822 | -0.349541 | -0.222063 | 0.123814 | -0.001395 | 0.381206 | 1.000000 | 0.001073 | 0.670628 | ... | 0.013476 | -0.015712 | -0.013193 | 0.002439 | -0.061639 | -0.051879 | -0.028117 | -0.101155 | 0.033731 | -0.014434 |
f_9 | -0.244551 | 0.829978 | 0.158686 | 0.097683 | -0.047879 | 0.894150 | -0.188076 | 0.001073 | 1.000000 | 0.164098 | ... | 0.675610 | 0.784833 | 0.770129 | 0.736075 | -0.073312 | -0.048994 | 0.102540 | -0.080203 | 0.113389 | 0.076679 |
f_10 | -0.214447 | 0.128465 | 0.073794 | 0.202167 | 0.098573 | 0.097449 | -0.380340 | 0.670628 | 0.164098 | 1.000000 | ... | 0.082449 | 0.052518 | 0.043116 | 0.042269 | -0.113481 | -0.095896 | 0.112275 | -0.587156 | 0.603358 | -0.013359 |
f_11 | -0.261624 | 0.745590 | -0.064076 | -0.082742 | -0.075843 | 0.765628 | 0.093376 | 0.167904 | 0.671358 | 0.102331 | ... | 0.630674 | 0.782581 | 0.790649 | 0.710990 | -0.160260 | -0.114133 | 0.127889 | 0.056237 | -0.067659 | 0.157588 |
f_12 | -0.209190 | 0.004035 | -0.081738 | 0.106767 | 0.009470 | -0.029363 | -0.363593 | 0.406409 | -0.008391 | 0.747509 | ... | -0.088211 | -0.135129 | -0.121701 | -0.147694 | -0.018188 | 0.045217 | 0.073414 | -0.610604 | 0.594751 | 0.018417 |
f_13 | -0.222342 | 0.020195 | 0.042723 | 0.224342 | 0.013574 | -0.017706 | -0.481003 | 0.289904 | 0.018342 | 0.730810 | ... | -0.084692 | -0.120182 | -0.109534 | -0.140570 | -0.067821 | 0.008266 | 0.128967 | -0.665751 | 0.674792 | 0.036129 |
f_14 | -0.220721 | 0.176080 | 0.299324 | 0.335270 | -0.016254 | 0.155767 | -0.574566 | 0.178362 | 0.261617 | 0.652360 | ... | 0.177034 | 0.141294 | 0.117372 | 0.130096 | -0.145173 | -0.104025 | 0.104333 | -0.539941 | 0.600364 | 0.044022 |
f_15 | -0.137901 | -0.118317 | -0.301641 | -0.039329 | 0.028305 | -0.147712 | -0.115334 | 0.335692 | -0.215468 | 0.502049 | ... | -0.292963 | -0.293204 | -0.250771 | -0.308273 | 0.035983 | 0.128789 | 0.071589 | -0.501766 | 0.443858 | -0.008092 |
f_16 | -0.178220 | 0.235500 | 0.439603 | 0.372116 | -0.029425 | 0.226015 | -0.563544 | 0.051995 | 0.365164 | 0.487945 | ... | 0.305778 | 0.269345 | 0.227190 | 0.262997 | -0.169739 | -0.162068 | 0.082550 | -0.369724 | 0.457185 | 0.050515 |
f_17 | 0.056430 | 0.237388 | -0.003753 | -0.000815 | 0.045836 | 0.302462 | -0.008360 | -0.245330 | 0.160027 | -0.231361 | ... | 0.119187 | 0.361130 | 0.392898 | 0.287938 | -0.055731 | -0.054833 | 0.368569 | 0.078798 | -0.081241 | 0.014977 |
f_18 | 0.027526 | 0.321276 | -0.046857 | -0.020119 | 0.065762 | 0.406917 | 0.027642 | -0.188000 | 0.207135 | -0.196430 | ... | 0.144958 | 0.459463 | 0.509775 | 0.353361 | -0.048525 | -0.040779 | 0.329074 | 0.058242 | -0.070072 | -0.006263 |
f_19 | 0.038746 | 0.022253 | 0.599107 | 0.494286 | -0.065304 | 0.046484 | -0.134812 | -0.254853 | 0.087192 | -0.190016 | ... | 0.181168 | 0.197681 | 0.167883 | 0.178385 | -0.073405 | -0.156083 | 0.074076 | 0.263886 | -0.134275 | 0.022329 |
f_20 | -0.159138 | -0.053111 | -0.193047 | -0.011078 | 0.048283 | -0.075633 | -0.222835 | 0.505829 | -0.080864 | 0.725196 | ... | -0.155566 | -0.193686 | -0.172891 | -0.201335 | 0.050078 | 0.103484 | 0.011602 | -0.519132 | 0.481139 | -0.049940 |
f_21 | -0.170053 | -0.057095 | -0.033185 | 0.132370 | 0.032183 | -0.080839 | -0.386666 | 0.361435 | -0.069741 | 0.722926 | ... | -0.147508 | -0.181691 | -0.165905 | -0.196407 | 0.002484 | 0.059500 | 0.086494 | -0.592765 | 0.587612 | -0.017439 |
f_22 | -0.240241 | 0.140805 | 0.299426 | 0.615556 | 0.122604 | 0.076952 | -0.606984 | -0.086842 | 0.120763 | 0.474004 | ... | -0.050153 | 0.002071 | 0.010254 | -0.045236 | -0.188274 | -0.148988 | 0.317913 | -0.870454 | 0.916544 | 0.035323 |
f_23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
f_24 | 0.026409 | -0.142899 | -0.669548 | -0.667478 | 0.001713 | -0.097027 | 0.906524 | 0.360343 | -0.186528 | -0.362216 | ... | -0.096460 | -0.111014 | -0.088658 | -0.074335 | 0.133210 | 0.150626 | -0.229779 | 0.568317 | -0.701463 | -0.040364 |
f_25 | -0.260500 | 0.131439 | 0.059858 | 0.466752 | 0.130162 | 0.060381 | -0.565818 | 0.041689 | 0.095497 | 0.584004 | ... | -0.101886 | -0.070467 | -0.052167 | -0.111762 | -0.147284 | -0.076648 | 0.279861 | -0.990288 | 0.986720 | -0.013202 |
f_26 | 0.397330 | -0.088328 | -0.013191 | -0.503382 | -0.109169 | -0.042262 | 0.138941 | -0.167225 | -0.039744 | -0.377968 | ... | 0.083950 | 0.044717 | 0.018422 | 0.096065 | 0.126362 | 0.064918 | -0.260992 | 0.700420 | -0.690917 | -0.054643 |
f_27 | 0.404138 | -0.013225 | 0.627137 | 0.115749 | -0.110526 | 0.010893 | -0.393415 | -0.455887 | 0.084778 | -0.252999 | ... | 0.186856 | 0.171926 | 0.118393 | 0.200911 | -0.012176 | -0.122450 | -0.147258 | 0.467314 | -0.328293 | -0.068181 |
f_28 | -0.173291 | 0.091962 | -0.119572 | 0.113152 | 0.138419 | 0.047533 | -0.145690 | 0.349434 | 0.059639 | 0.548929 | ... | -0.009764 | -0.027413 | -0.019933 | -0.042534 | -0.094926 | -0.043885 | 0.259758 | -0.515659 | 0.489585 | 0.061178 |
f_29 | -0.158883 | 0.170798 | 0.012991 | 0.167272 | 0.045215 | 0.118327 | -0.283833 | 0.175192 | 0.145497 | 0.507455 | ... | 0.089752 | 0.039657 | 0.037170 | 0.017672 | -0.036557 | -0.011833 | 0.243373 | -0.481778 | 0.484850 | 0.021424 |
f_30 | 0.237770 | -0.163065 | -0.368946 | -0.551173 | -0.094680 | -0.100486 | 0.725108 | 0.082133 | -0.167507 | -0.536527 | ... | -0.018595 | -0.031007 | -0.027820 | 0.007076 | 0.158981 | 0.111419 | -0.299345 | 0.831027 | -0.907202 | -0.050517 |
f_31 | 0.035824 | 0.005472 | -0.097925 | -0.358789 | -0.175452 | 0.058062 | 0.228706 | -0.013229 | 0.070491 | -0.233577 | ... | 0.266927 | 0.149927 | 0.132506 | 0.153446 | 0.268631 | 0.208135 | -0.115635 | 0.487006 | -0.499680 | 0.041730 |
f_32 | -0.094846 | 0.118776 | 0.585351 | 0.686419 | 0.062919 | 0.060508 | -0.818055 | -0.235055 | 0.172914 | 0.432378 | ... | 0.068535 | 0.063352 | 0.036559 | 0.041763 | -0.184435 | -0.199718 | 0.231370 | -0.673397 | 0.785353 | 0.013173 |
f_33 | -0.036654 | -0.009433 | -0.061054 | -0.064612 | 0.044074 | -0.009910 | 0.101277 | 0.082125 | -0.020690 | 0.003666 | ... | -0.017361 | -0.026800 | -0.020555 | -0.025864 | 0.063069 | 0.091043 | 0.051232 | 0.022010 | -0.034338 | -0.012170 |
f_34 | -0.091356 | 0.118634 | 0.585760 | 0.686369 | 0.059118 | 0.060826 | -0.819826 | -0.239614 | 0.173239 | 0.428950 | ... | 0.069370 | 0.064939 | 0.037864 | 0.043431 | -0.187893 | -0.205192 | 0.225773 | -0.670186 | 0.782269 | 0.014008 |
f_35 | -0.225343 | 0.869227 | 0.178255 | 0.151832 | -0.044723 | 0.884713 | -0.246512 | -0.021463 | 0.979517 | 0.201878 | ... | 0.624000 | 0.749043 | 0.739734 | 0.698957 | -0.093407 | -0.065521 | 0.118231 | -0.169488 | 0.203509 | 0.046540 |
f_36 | -0.216387 | 0.873996 | 0.177423 | 0.147977 | -0.039225 | 0.892963 | -0.239744 | -0.019355 | 0.980876 | 0.198350 | ... | 0.625060 | 0.758187 | 0.748835 | 0.708918 | -0.095036 | -0.068246 | 0.116058 | -0.163067 | 0.196819 | 0.040756 |
f_37 | 0.281274 | -0.148739 | 0.246582 | 0.284814 | 0.100720 | -0.179259 | -0.386837 | -0.250518 | -0.209671 | 0.085596 | ... | -0.307490 | -0.242312 | -0.232705 | -0.244889 | -0.007764 | -0.028332 | 0.080429 | -0.354068 | 0.398088 | -0.100417 |
f_38 | -0.260929 | 0.443913 | 0.332342 | 0.262117 | -0.006470 | 0.480804 | -0.383085 | -0.086931 | 0.778100 | 0.221490 | ... | 0.579851 | 0.496987 | 0.437209 | 0.485860 | 0.008159 | -0.006834 | 0.108795 | -0.180859 | 0.247958 | 0.041885 |
f_39 | -0.452966 | 0.080779 | -0.279094 | 0.282325 | 0.204627 | 0.021929 | -0.130835 | 0.297139 | 0.003008 | 0.517838 | ... | -0.180546 | -0.134773 | -0.097368 | -0.173564 | -0.154701 | -0.052587 | 0.305194 | -0.884484 | 0.811961 | 0.033768 |
f_40 | -0.499695 | 0.071089 | -0.165125 | 0.344152 | 0.232303 | 0.021595 | -0.051147 | 0.281511 | 0.001339 | 0.424904 | ... | -0.152639 | -0.087815 | -0.057056 | -0.125037 | -0.194900 | -0.117922 | 0.333216 | -0.740305 | 0.691901 | 0.066220 |
f_41 | -0.286190 | 0.555154 | 0.186920 | -0.046934 | -0.066930 | 0.594273 | -0.115014 | 0.013476 | 0.675610 | 0.082449 | ... | 1.000000 | 0.703587 | 0.632130 | 0.714021 | 0.179308 | 0.106656 | 0.083573 | 0.120024 | -0.068486 | 0.148987 |
f_42 | -0.167466 | 0.777807 | 0.178287 | 0.032402 | -0.014877 | 0.844597 | -0.100003 | -0.015712 | 0.784833 | 0.052518 | ... | 0.703587 | 1.000000 | 0.979836 | 0.932383 | -0.120047 | -0.137111 | 0.119719 | 0.090195 | -0.047456 | 0.050657 |
f_43 | -0.156916 | 0.800939 | 0.129653 | 0.022234 | -0.013742 | 0.868353 | -0.074308 | -0.013193 | 0.770129 | 0.043116 | ... | 0.632130 | 0.979836 | 1.000000 | 0.860925 | -0.131742 | -0.114537 | 0.134836 | 0.064703 | -0.034036 | 0.046533 |
f_44 | -0.141792 | 0.716496 | 0.176883 | 0.000664 | -0.012346 | 0.770044 | -0.073751 | 0.002439 | 0.736075 | 0.042269 | ... | 0.714021 | 0.932383 | 0.860925 | 1.000000 | -0.098196 | -0.156458 | 0.072550 | 0.133416 | -0.089327 | 0.031244 |
f_45 | -0.139478 | -0.080879 | -0.088310 | -0.220461 | -0.076695 | -0.077783 | 0.077207 | -0.061639 | -0.073312 | -0.113481 | ... | 0.179308 | -0.120047 | -0.131742 | -0.098196 | 1.000000 | 0.545285 | -0.061429 | 0.130842 | -0.141206 | 0.016261 |
f_46 | -0.163693 | -0.048315 | -0.182458 | -0.204776 | -0.080136 | -0.046834 | 0.088633 | -0.051879 | -0.048994 | -0.095896 | ... | 0.106656 | -0.137111 | -0.114537 | -0.156458 | 0.545285 | 1.000000 | -0.011024 | 0.047073 | -0.079484 | 0.058537 |
f_47 | -0.202983 | 0.118792 | -0.022098 | 0.106758 | 0.070070 | 0.126850 | -0.157243 | -0.028117 | 0.102540 | 0.112275 | ... | 0.083573 | 0.119719 | 0.134836 | 0.072550 | -0.061429 | -0.011024 | 1.000000 | -0.292330 | 0.299541 | 0.436890 |
f_48 | 0.294422 | -0.128222 | 0.048291 | -0.394081 | -0.135294 | -0.058752 | 0.483034 | -0.101155 | -0.080203 | -0.587156 | ... | 0.120024 | 0.090195 | 0.064703 | 0.133416 | 0.130842 | 0.047073 | -0.292330 | 1.000000 | -0.974548 | -0.003163 |
f_49 | -0.253698 | 0.139417 | 0.162600 | 0.476127 | 0.116896 | 0.069731 | -0.612819 | 0.033731 | 0.113389 | 0.603358 | ... | -0.068486 | -0.047456 | -0.034036 | -0.089327 | -0.141206 | -0.079484 | 0.299541 | -0.974548 | 1.000000 | 0.008365 |
target | -0.180531 | 0.034128 | -0.035221 | -0.050489 | -0.078598 | 0.049318 | -0.026183 | -0.014434 | 0.076679 | -0.013359 | ... | 0.148987 | 0.050657 | 0.046533 | 0.031244 | 0.016261 | 0.058537 | 0.436890 | -0.003163 | 0.008365 | 1.000000 |
50 rows × 50 columns
plt.figure(figsize=(50, 50))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm",
fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix - Oil Spill Dataset", fontsize=16)
plt.show()
Insights:
Multicollinearity: Highly correlated features can lead to multicollinearity, which can make it difficult for models to accurately estimate the individual effects of each feature.This can result in:
Increased variance in model coefficients, making them less reliable.
Reduced model interpretability, as it's unclear which features are truly driving predictions.
Overfitting: Highly correlated variables can lead to overfitting in some models, especially if the dataset is not large enough.
Model Stability: Unnecessary redundancy in features may lead to less stable model performance.
Removing Highly Correlated Columns
# Correlation matrix
corr_matrix = df.corr()
# Selecting upper triangle of correlation matrix
upper = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find features/columns with correlation greater than 0.90
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]
print(f"Columns with high correlation values : \n{to_drop} \n\nTotal Correlated columns : {len(to_drop)}")
Columns with high correlation values : ['f_6', 'f_13', 'f_16', 'f_18', 'f_20', 'f_21', 'f_24', 'f_25', 'f_34', 'f_35', 'f_36', 'f_40', 'f_43', 'f_44', 'f_49'] Total Correlated columns : 15
# Drop features/columns
df1 = df.copy()
df1.drop(to_drop, axis=1, inplace=True)
# dropping F_23 since it only has single value
f23 = "f_23"
df1.drop(f23, axis=1, inplace=True)
cleaned_df = df1.copy()
print("Original Dataframe:", df.shape)
print("Cleaned Dataframe (Highly correlated columns removed):", cleaned_df.shape)
print("\nRemoved Columns from dataset:\n", to_drop + [f23])
Original Dataframe: (937, 50) Cleaned Dataframe (Highly correlated columns removed): (937, 34) Removed Columns from dataset: ['f_6', 'f_13', 'f_16', 'f_18', 'f_20', 'f_21', 'f_24', 'f_25', 'f_34', 'f_35', 'f_36', 'f_40', 'f_43', 'f_44', 'f_49', 'f_23']
x = cleaned_df.drop("target", axis=1)
y = cleaned_df["target"]
print(type(x))
print(type(y))
print(x.shape)
print(y.shape)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'> (937, 33) (937,)
# splitting the dataset into 70% training data and 30% test data
X_train, X_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=42
)
print(f"Split Check Test values : {937 * 0.3} & Train values : {937 * 0.7}")
# rows , columns
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Split Check Test values : 281.09999999999997 & Train values : 655.9 (655, 33) (282, 33) (655,) (282,)
X_train, X_test
( f_1 f_2 f_3 f_4 f_5 f_7 f_8 f_9 f_10 f_11 \ 757 25 18 78.11 456.00 70 48.94 7.79 1588.0 0.16 91.8 693 46 18 153.39 464.39 13 70.33 15.87 1228.0 0.23 118.7 854 122 14 141.86 446.50 66 51.71 5.44 1064.0 0.10 106.6 501 7 603 299.61 1472.03 114 25.01 8.98 7654.0 0.36 110.8 664 17 162 7.70 546.00 64 70.65 15.70 5362.0 0.22 244.7 .. ... ... ... ... ... ... ... ... ... ... 106 96 73 1391.75 934.48 54 41.23 6.75 2570.0 0.16 71.0 270 227 63 1139.70 934.33 138 31.78 7.68 2730.0 0.24 57.7 860 128 15 80.87 264.07 61 54.40 7.73 1302.0 0.14 93.3 435 3 32389 874.99 1210.98 35 24.62 9.75 62250.0 0.40 731.7 102 92 121 1171.12 1388.43 66 40.15 9.26 3440.0 0.23 87.9 ... f_33 f_37 f_38 f_39 f_41 f_42 f_45 f_46 f_47 \ 757 ... 0.0 0.00 17.30 82 853.81 180.00 12.20 0 2674.72 693 ... 0.0 0.00 10.34 102 742.16 270.00 6.00 0 8227.75 854 ... 0.0 0.00 9.98 82 484.66 180.00 3.85 0 3830.45 501 ... 0.0 0.01 69.09 143 0.00 0.00 0.00 0 7780.79 664 ... 0.0 0.00 21.91 102 1288.60 1170.00 1.47 0 3777.50 .. ... ... ... ... ... ... ... ... ... ... 106 ... 0.0 0.01 36.19 78 610.33 500.00 2.54 0 8744.58 270 ... 0.0 0.01 47.32 64 992.47 282.84 11.70 0 4915.12 860 ... 0.0 0.01 13.95 82 524.79 127.28 6.87 0 3235.86 435 ... 0.0 0.00 85.08 133 6740.41 8789.57 1.00 0 9422.57 102 ... 0.0 0.01 39.12 78 1411.56 335.41 8.82 0 4880.79 f_48 757 65.96 693 66.01 854 65.98 501 36.22 664 66.06 .. ... 106 65.98 270 65.92 860 65.71 435 36.59 102 66.18 [655 rows x 33 columns], f_1 f_2 f_3 f_4 f_5 f_7 f_8 f_9 f_10 f_11 ... \ 321 29 105 881.92 1128.79 83 38.90 8.51 2710.0 0.22 96.9 ... 70 60 111 1153.32 1283.44 41 41.25 5.98 1760.0 0.14 157.7 ... 209 17 867 1059.49 581.31 46 31.08 8.26 15780.0 0.27 137.4 ... 656 9 85 71.06 469.47 140 70.85 11.28 4626.0 0.16 148.8 ... 685 38 15 32.47 582.13 156 73.27 12.11 1080.0 0.17 112.5 ... .. ... ... ... ... ... ... ... ... ... ... ... 430 183 51 1340.16 898.61 64 42.45 7.88 1430.0 0.19 89.2 ... 292 317 117 1269.88 917.89 123 29.16 8.85 2440.0 0.30 119.9 ... 412 151 64 991.70 1018.53 175 37.52 9.27 1400.0 0.25 114.3 ... 557 63 59 1253.20 1192.53 76 29.51 7.32 1664.0 0.25 49.9 ... 133 123 72 1606.14 1110.06 99 36.50 6.89 1760.0 0.19 102.3 ... f_33 f_37 f_38 f_39 f_41 f_42 f_45 f_46 f_47 f_48 321 0.0 0.00 27.98 85 955.25 353.55 4.21 0 3425.75 65.97 70 0.0 0.00 11.16 78 710.63 500.00 2.40 0 5915.80 66.12 209 0.0 0.00 114.88 64 3146.82 1131.37 4.93 0 5679.31 65.74 656 0.0 0.00 31.08 102 1279.14 509.12 3.95 0 6376.53 65.98 685 0.0 0.01 9.60 102 685.42 201.25 6.47 0 3285.95 66.11 .. ... ... ... ... ... ... ... ... ... ... 430 0.0 0.01 16.04 85 604.15 269.26 4.07 0 5842.59 65.93 292 0.0 0.01 20.35 64 610.33 721.11 1.51 0 5512.84 65.93 412 0.0 0.01 12.25 85 450.00 300.00 1.76 0 2914.09 65.94 557 0.0 0.02 33.37 143 582.16 201.94 5.85 0 26008.35 36.85 133 0.0 0.01 17.21 78 559.02 304.14 2.30 0 4456.55 66.10 [282 rows x 33 columns])
y_train, y_test
(757 0 693 0 854 0 501 0 664 1 .. 106 0 270 0 860 0 435 0 102 0 Name: target, Length: 655, dtype: int64, 321 0 70 0 209 0 656 0 685 0 .. 430 0 292 0 412 0 557 0 133 0 Name: target, Length: 282, dtype: int64)
sc = StandardScaler()
sc.fit_transform(X_train)
sc.transform(X_test)
array([[-0.84347997, -0.12493331, 0.30561268, ..., -0.38616422, -0.69720058, 0.41600293], [-0.36292832, -0.12189254, 0.75321954, ..., -0.38616422, -0.30840223, 0.43029588], [-1.02949997, 0.26124516, 0.59847027, ..., -0.38616422, -0.34532796, 0.39408706], ..., [ 1.0477233 , -0.14571195, 0.48666751, ..., -0.38616422, -0.77709157, 0.41314433], [-0.31642332, -0.14824593, 0.91794678, ..., -0.38616422, 2.82886418, -2.35873678], [ 0.61367665, -0.14165758, 1.50003362, ..., -0.38616422, -0.53625066, 0.42839016]])
# Function to evaluate and store results in a dictionary
def calculate_scores(model, X_train, y_train, X_test, y_test):
train_score = accuracy_score(
y_train, model.predict(X_train)
) # Calculate train score
test_score = accuracy_score(
y_test, model.predict(X_test)) # Calculate test score
return train_score, test_score
def evaluate_model(model, model_name, X_test, y_test):
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# Mean Squared Error and R-squared Score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Confusion Matrix and classification report
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_prob)
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc_value = auc(fpr, tpr)
cls_report = classification_report(y_test, y_pred, zero_division=0)
# Display results in tabular format
results_table = [
["Model", model_name],
["Mean Squared Error", mse],
["R-squared Score", r2],
["Confusion Matrix", f"{cm}"],
["True Positive", cm[0, 0]],
["False Negative", cm[0, 1]],
["False Positive", cm[1, 0]],
["True Negative", cm[1, 1]],
["Accuracy", acc],
["AUC", auc_score],
["Train Score", train_score],
["Test Score", test_score],
]
print(tabulate(results_table, headers=[
"Metric", "Value"], tablefmt="heavy_grid"))
# Display Classification Report
print("\nClassification Report:\n")
print(cls_report)
# Plot Confusion Matrix
plt.matshow(cm, cmap=plt.cm.Reds)
plt.title(f"Confusion Matrix for {model_name}")
plt.colorbar()
plt.xlabel("Predicted")
plt.ylabel("True")
# Add annotations to matrix
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
plt.text(j, i, str(cm[i, j]), ha="center",
va="center", color="black")
plt.show()
# Store results in the dictionary
return {
"Model": model_name,
"Mean Squared Error": mse,
"R-squared Score": r2,
"True Positive": cm[0, 0],
"False Negative": cm[0, 1],
"False Positive": cm[1, 0],
"True Negative": cm[1, 1],
"Accuracy": acc,
"AUC": auc_score,
"ROC Curve FPR": fpr,
"ROC Curve TPR": tpr,
"AUC Value": auc_value,
"Confusion Matrix": cm,
"Train Score": {train_score},
"Test Score": {test_score},
}
# Function to plot ROC curve
def plot_roc_curve(model, X_test, y_test):
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc_value = auc(fpr, tpr)
# Plot ROC curve
plt.plot(fpr, tpr, color="orange",
label=f"ROC Curve (AUC = {auc_value:.4f})")
plt.plot([0, 1], [0, 1], label="TPR=FPR", linestyle="--")
plt.title(f"ROC Curve for {model_name}")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.grid()
plt.legend()
plt.show()
models = [ ("Logistic Regression", LogisticRegression(max_iter=1000, C=1.0, solver='lbfgs')), ("k-Nearest Neighbors", KNeighborsClassifier(n_neighbors=5, weights='uniform')), ("Decision Tree", DecisionTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1)), ("Random Forest", RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1)), ("AdaBoost", AdaBoostClassifier(n_estimators=50, learning_rate=1.0)), ("Bagging", BaggingClassifier(n_estimators=10, max_samples=1.0, max_features=1.0)), ("Gradient Boosting", GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)), ("Gaussian Naive Bayes", GaussianNB()), ("SVM", SVC(probability=True, C=1.0, kernel='rbf')), ]
# Placeholder for results
evaluation_results = []
# 1. Passing the model name
model_name = "Logistic Regression"
# 2. model parameters
model = LogisticRegression(max_iter=1000, C=1.0, solver="lbfgs")
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
LogisticRegression(max_iter=1000) ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Logistic Regression ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.031914893617021274 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ 0.14860784971486074 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[266 5] ┃ ┃ ┃ [ 4 7]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 266 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 5 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 4 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 7 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9680851063829787 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.8436766185843676 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 0.9679389312977099 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9680851063829787 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.99 0.98 0.98 271 1 0.58 0.64 0.61 11 accuracy 0.97 282 macro avg 0.78 0.81 0.80 282 weighted avg 0.97 0.97 0.97 282
***************************************************************************
---------------------------------------------------------------------------
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=1000)
# 1. Passing the model name
model_name = "k-Nearest Neighbors"
# 2. model parameters
model = KNeighborsClassifier(n_neighbors=5, weights="uniform")
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
KNeighborsClassifier() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ k-Nearest Neighbors ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.04609929078014184 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ -0.22978866152297894 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[264 7] ┃ ┃ ┃ [ 6 5]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 264 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 7 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 6 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 5 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9539007092198581 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.8297551157329754 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 0.9526717557251908 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9539007092198581 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.98 0.97 0.98 271 1 0.42 0.45 0.43 11 accuracy 0.95 282 macro avg 0.70 0.71 0.71 282 weighted avg 0.96 0.95 0.95 282
***************************************************************************
---------------------------------------------------------------------------
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
# 1. Passing the model name
model_name = "Decision Tree"
# 2. model parameters
model = DecisionTreeClassifier(
max_depth=None, min_samples_split=2, min_samples_leaf=1)
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
DecisionTreeClassifier() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Decision Tree ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.04609929078014184 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ -0.22978866152297894 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[264 7] ┃ ┃ ┃ [ 6 5]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 264 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 7 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 6 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 5 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9539007092198581 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.7143575981214358 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 1.0 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9539007092198581 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.98 0.97 0.98 271 1 0.42 0.45 0.43 11 accuracy 0.95 282 macro avg 0.70 0.71 0.71 282 weighted avg 0.96 0.95 0.95 282
***************************************************************************
---------------------------------------------------------------------------
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier()
# 1. Passing the model name
model_name = "Random Forest"
# 2. model parameters
model = RandomForestClassifier(
n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1
)
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
RandomForestClassifier() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Random Forest ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.031914893617021274 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ 0.14860784971486074 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[270 1] ┃ ┃ ┃ [ 8 3]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 270 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 1 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 8 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 3 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9680851063829787 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.9050654142905066 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 1.0 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9680851063829787 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.97 1.00 0.98 271 1 0.75 0.27 0.40 11 accuracy 0.97 282 macro avg 0.86 0.63 0.69 282 weighted avg 0.96 0.97 0.96 282
***************************************************************************
---------------------------------------------------------------------------
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
# 1. Passing the model name
model_name = "AdaBoost"
# 2. model parameters
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
AdaBoostClassifier() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ AdaBoost ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.03546099290780142 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ 0.054008721905400736 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[268 3] ┃ ┃ ┃ [ 7 4]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 268 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 3 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 7 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 4 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9645390070921985 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.7926870177792688 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 1.0 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9645390070921985 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.97 0.99 0.98 271 1 0.57 0.36 0.44 11 accuracy 0.96 282 macro avg 0.77 0.68 0.71 282 weighted avg 0.96 0.96 0.96 282
***************************************************************************
---------------------------------------------------------------------------
AdaBoostClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier()
# 1. Passing the model name
model_name = "Bagging"
# 2. model parameters
model = BaggingClassifier(n_estimators=10, max_samples=1.0, max_features=1.0)
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
BaggingClassifier() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Bagging ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.04609929078014184 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ -0.22978866152297894 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[267 4] ┃ ┃ ┃ [ 9 2]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 267 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 4 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 9 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 2 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9539007092198581 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.8596108688359612 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 0.9969465648854962 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9539007092198581 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.97 0.99 0.98 271 1 0.33 0.18 0.24 11 accuracy 0.95 282 macro avg 0.65 0.58 0.61 282 weighted avg 0.94 0.95 0.95 282
***************************************************************************
---------------------------------------------------------------------------
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier()
# 1. Passing the model name
model_name = "Gradient Boosting"
# 2. model parameters
model = GradientBoostingClassifier(
n_estimators=100, learning_rate=0.1, max_depth=3)
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
GradientBoostingClassifier() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Gradient Boosting ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.031914893617021274 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ 0.14860784971486074 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[268 3] ┃ ┃ ┃ [ 6 5]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 268 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 3 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 6 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 5 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9680851063829787 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.858269037235827 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 1.0 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9680851063829787 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.98 0.99 0.98 271 1 0.62 0.45 0.53 11 accuracy 0.97 282 macro avg 0.80 0.72 0.75 282 weighted avg 0.96 0.97 0.97 282
***************************************************************************
---------------------------------------------------------------------------
GradientBoostingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier()
# 1. Passing the model name
model_name = "Gaussian Naive Bayes"
# 2. model parameters
model = GaussianNB()
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
GaussianNB() ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Gaussian Naive Bayes ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.0851063829787234 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ -1.2703790674270383 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[251 20] ┃ ┃ ┃ [ 4 7]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 251 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 20 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 4 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 7 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9148936170212766 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.731969137873197 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 0.9267175572519084 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9148936170212766 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.98 0.93 0.95 271 1 0.26 0.64 0.37 11 accuracy 0.91 282 macro avg 0.62 0.78 0.66 282 weighted avg 0.96 0.91 0.93 282
***************************************************************************
---------------------------------------------------------------------------
GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB()
# 1. Passing the model name
model_name = "Support Vector Machine"
# 2. model parameters
model = SVC(probability=True, C=1.0, kernel="rbf")
# 3.1: Fit the Model
model.fit(X_train, y_train)
print("\n", model, "\n")
# 3.2: Score Calculation
train_score, test_score = calculate_scores(
model, X_train, y_train, X_test, y_test)
# 3.3: Evaluate and Store Results
results = evaluate_model(model, model_name, X_test, y_test)
evaluation_results.append(results)
# Section 3.3: Plot ROC curve
print("*" * 75)
plot_roc_curve(model, X_test, y_test)
print("-" * 75)
# Model Detail
model
SVC(probability=True) ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Model ┃ Support Vector Machine ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Mean Squared Error ┃ 0.03900709219858156 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ R-squared Score ┃ -0.04059040590405916 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Confusion Matrix ┃ [[271 0] ┃ ┃ ┃ [ 11 0]] ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Positive ┃ 271 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Negative ┃ 0 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ False Positive ┃ 11 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ True Negative ┃ 0 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Accuracy ┃ 0.9609929078014184 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ AUC ┃ 0.9389466621938947 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Train Score ┃ 0.9603053435114504 ┃ ┣━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ Test Score ┃ 0.9609929078014184 ┃ ┗━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┛ Classification Report: precision recall f1-score support 0 0.96 1.00 0.98 271 1 0.00 0.00 0.00 11 accuracy 0.96 282 macro avg 0.48 0.50 0.49 282 weighted avg 0.92 0.96 0.94 282
***************************************************************************
---------------------------------------------------------------------------
SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(probability=True)
print(f"\nTotal models used & Evaluated : {len(evaluation_results)} & \nSaved parameters in each results :{len(results)}")
for result in evaluation_results:
print("\n", result)
Total models used & Evaluated : 9 & Saved parameters in each results :15 {'Model': 'Logistic Regression', 'Mean Squared Error': 0.031914893617021274, 'R-squared Score': 0.14860784971486074, 'True Positive': 266, 'False Negative': 5, 'False Positive': 4, 'True Negative': 7, 'Accuracy': 0.9680851063829787, 'AUC': 0.8436766185843676, 'ROC Curve FPR': array([0. , 0. , 0. , 0.00369004, 0.00369004, 0.11439114, 0.11439114, 0.32472325, 0.32472325, 0.35424354, 0.35424354, 0.91512915, 0.91512915, 1. ]), 'ROC Curve TPR': array([0. , 0.09090909, 0.36363636, 0.36363636, 0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.90909091, 0.90909091, 1. , 1. ]), 'AUC Value': 0.8436766185843676, 'Confusion Matrix': array([[266, 5], [ 4, 7]]), 'Train Score': {0.9679389312977099}, 'Test Score': {0.9680851063829787}} {'Model': 'k-Nearest Neighbors', 'Mean Squared Error': 0.04609929078014184, 'R-squared Score': -0.22978866152297894, 'True Positive': 264, 'False Negative': 7, 'False Positive': 6, 'True Negative': 5, 'Accuracy': 0.9539007092198581, 'AUC': 0.8297551157329754, 'ROC Curve FPR': array([0. , 0.00369004, 0.02583026, 0.05166052, 0.12177122, 1. ]), 'ROC Curve TPR': array([0. , 0. , 0.45454545, 0.72727273, 0.72727273, 1. ]), 'AUC Value': 0.8297551157329754, 'Confusion Matrix': array([[264, 7], [ 6, 5]]), 'Train Score': {0.9526717557251908}, 'Test Score': {0.9539007092198581}} {'Model': 'Decision Tree', 'Mean Squared Error': 0.04609929078014184, 'R-squared Score': -0.22978866152297894, 'True Positive': 264, 'False Negative': 7, 'False Positive': 6, 'True Negative': 5, 'Accuracy': 0.9539007092198581, 'AUC': 0.7143575981214358, 'ROC Curve FPR': array([0. , 0.02583026, 1. ]), 'ROC Curve TPR': array([0. , 0.45454545, 1. ]), 'AUC Value': 0.7143575981214358, 'Confusion Matrix': array([[264, 7], [ 6, 5]]), 'Train Score': {1.0}, 'Test Score': {0.9539007092198581}} {'Model': 'Random Forest', 'Mean Squared Error': 0.031914893617021274, 'R-squared Score': 0.14860784971486074, 'True Positive': 270, 'False Negative': 1, 'False Positive': 8, 'True Negative': 3, 'Accuracy': 0.9680851063829787, 'AUC': 0.9050654142905066, 'ROC Curve FPR': array([0. , 0.00369004, 0.00369004, 0.00738007, 0.01476015, 0.01476015, 0.01476015, 0.01845018, 0.01845018, 0.03690037, 0.03690037, 0.04428044, 0.04797048, 0.05904059, 0.08856089, 0.099631 , 0.11808118, 0.12915129, 0.15498155, 0.18819188, 0.21402214, 0.25092251, 0.3099631 , 0.42804428, 0.60885609, 1. ]), 'ROC Curve TPR': array([0. , 0. , 0.27272727, 0.27272727, 0.27272727, 0.45454545, 0.54545455, 0.54545455, 0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.90909091, 1. , 1. ]), 'AUC Value': 0.9050654142905066, 'Confusion Matrix': array([[270, 1], [ 8, 3]]), 'Train Score': {1.0}, 'Test Score': {0.9680851063829787}} {'Model': 'AdaBoost', 'Mean Squared Error': 0.03546099290780142, 'R-squared Score': 0.054008721905400736, 'True Positive': 268, 'False Negative': 3, 'False Positive': 7, 'True Negative': 4, 'Accuracy': 0.9645390070921985, 'AUC': 0.7926870177792688, 'ROC Curve FPR': array([0. , 0.00369004, 0.00369004, 0.00738007, 0.00738007, 0.08487085, 0.08487085, 0.28413284, 0.28413284, 0.33579336, 0.33579336, 0.35424354, 0.35424354, 0.36531365, 0.36531365, 0.36900369, 0.36900369, 0.46494465, 0.46494465, 0.87822878, 0.88560886, 1. ]), 'ROC Curve TPR': array([0. , 0. , 0.18181818, 0.18181818, 0.36363636, 0.36363636, 0.45454545, 0.45454545, 0.54545455, 0.54545455, 0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.90909091, 0.90909091, 1. , 1. , 1. , 1. ]), 'AUC Value': 0.7926870177792688, 'Confusion Matrix': array([[268, 3], [ 7, 4]]), 'Train Score': {1.0}, 'Test Score': {0.9645390070921985}} {'Model': 'Bagging', 'Mean Squared Error': 0.04609929078014184, 'R-squared Score': -0.22978866152297894, 'True Positive': 267, 'False Negative': 4, 'False Positive': 9, 'True Negative': 2, 'Accuracy': 0.9539007092198581, 'AUC': 0.8596108688359612, 'ROC Curve FPR': array([0. , 0. , 0. , 0.01476015, 0.02214022, 0.0295203 , 0.05535055, 0.099631 , 0.19557196, 1. ]), 'ROC Curve TPR': array([0. , 0.09090909, 0.18181818, 0.18181818, 0.45454545, 0.54545455, 0.63636364, 0.72727273, 0.81818182, 1. ]), 'AUC Value': 0.8596108688359612, 'Confusion Matrix': array([[267, 4], [ 9, 2]]), 'Train Score': {0.9969465648854962}, 'Test Score': {0.9539007092198581}} {'Model': 'Gradient Boosting', 'Mean Squared Error': 0.031914893617021274, 'R-squared Score': 0.14860784971486074, 'True Positive': 268, 'False Negative': 3, 'False Positive': 6, 'True Negative': 5, 'Accuracy': 0.9680851063829787, 'AUC': 0.858269037235827, 'ROC Curve FPR': array([0. , 0. , 0.00369004, 0.00369004, 0.00738007, 0.00738007, 0.01107011, 0.01107011, 0.02214022, 0.02214022, 0.03690037, 0.03690037, 0.07749077, 0.07749077, 0.10332103, 0.10332103, 0.39852399, 0.41328413, 0.43173432, 0.44280443, 0.44649446, 0.45756458, 0.46494465, 0.4797048 , 0.48708487, 0.50184502, 0.52398524, 0.5498155 , 0.58302583, 0.59409594, 0.61254613, 0.62361624, 0.62730627, 0.65313653, 0.66420664, 0.67158672, 0.71586716, 0.72324723, 0.73062731, 0.74169742, 0.83763838, 0.84132841, 0.84870849, 0.8597786 , 0.86715867, 0.87822878, 0.88929889, 0.89667897, 0.91512915, 0.95940959, 0.96309963, 0.97416974, 0.98154982, 0.99630996, 1. ]), 'ROC Curve TPR': array([0. , 0.09090909, 0.09090909, 0.18181818, 0.18181818, 0.36363636, 0.36363636, 0.45454545, 0.45454545, 0.54545455, 0.54545455, 0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.81818182, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 0.90909091, 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. ]), 'AUC Value': 0.858269037235827, 'Confusion Matrix': array([[268, 3], [ 6, 5]]), 'Train Score': {1.0}, 'Test Score': {0.9680851063829787}} {'Model': 'Gaussian Naive Bayes', 'Mean Squared Error': 0.0851063829787234, 'R-squared Score': -1.2703790674270383, 'True Positive': 251, 'False Negative': 20, 'False Positive': 4, 'True Negative': 7, 'Accuracy': 0.9148936170212766, 'AUC': 0.731969137873197, 'ROC Curve FPR': array([0. , 0. , 0. , 0.00369004, 0.00369004, 0.01107011, 0.01107011, 0.01476015, 0.01476015, 0.02214022, 0.02214022, 0.46494465, 0.46494465, 0.46863469, 0.46863469, 0.97785978, 0.97785978, 0.98523985, 0.98523985, 1. ]), 'ROC Curve TPR': array([0. , 0.09090909, 0.27272727, 0.27272727, 0.36363636, 0.36363636, 0.45454545, 0.45454545, 0.54545455, 0.54545455, 0.63636364, 0.63636364, 0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.90909091, 0.90909091, 1. , 1. ]), 'AUC Value': 0.731969137873197, 'Confusion Matrix': array([[251, 20], [ 4, 7]]), 'Train Score': {0.9267175572519084}, 'Test Score': {0.9148936170212766}} {'Model': 'Support Vector Machine', 'Mean Squared Error': 0.03900709219858156, 'R-squared Score': -0.04059040590405916, 'True Positive': 271, 'False Negative': 0, 'False Positive': 11, 'True Negative': 0, 'Accuracy': 0.9609929078014184, 'AUC': 0.9389466621938947, 'ROC Curve FPR': array([0. , 0. , 0.00369004, 0.00369004, 0.00738007, 0.00738007, 0.01845018, 0.01845018, 0.0295203 , 0.0295203 , 0.04428044, 0.04428044, 0.05166052, 0.05166052, 0.16605166, 0.16605166, 0.30258303, 0.30258303, 1. ]), 'ROC Curve TPR': array([0. , 0.09090909, 0.09090909, 0.27272727, 0.27272727, 0.36363636, 0.36363636, 0.45454545, 0.45454545, 0.54545455, 0.54545455, 0.72727273, 0.72727273, 0.81818182, 0.81818182, 0.90909091, 0.90909091, 1. , 1. ]), 'AUC Value': 0.9389466621938947, 'Confusion Matrix': array([[271, 0], [ 11, 0]]), 'Train Score': {0.9603053435114504}, 'Test Score': {0.9609929078014184}}
model_performance = pd.DataFrame(evaluation_results)
model_performance
Model | Mean Squared Error | R-squared Score | True Positive | False Negative | False Positive | True Negative | Accuracy | AUC | ROC Curve FPR | ROC Curve TPR | AUC Value | Confusion Matrix | Train Score | Test Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.031915 | 0.148608 | 266 | 5 | 4 | 7 | 0.968085 | 0.843677 | [0.0, 0.0, 0.0, 0.0036900369003690036, 0.00369... | [0.0, 0.09090909090909091, 0.36363636363636365... | 0.843677 | [[266, 5], [4, 7]] | {0.9679389312977099} | {0.9680851063829787} |
1 | k-Nearest Neighbors | 0.046099 | -0.229789 | 264 | 7 | 6 | 5 | 0.953901 | 0.829755 | [0.0, 0.0036900369003690036, 0.025830258302583... | [0.0, 0.0, 0.45454545454545453, 0.727272727272... | 0.829755 | [[264, 7], [6, 5]] | {0.9526717557251908} | {0.9539007092198581} |
2 | Decision Tree | 0.046099 | -0.229789 | 264 | 7 | 6 | 5 | 0.953901 | 0.714358 | [0.0, 0.025830258302583026, 1.0] | [0.0, 0.45454545454545453, 1.0] | 0.714358 | [[264, 7], [6, 5]] | {1.0} | {0.9539007092198581} |
3 | Random Forest | 0.031915 | 0.148608 | 270 | 1 | 8 | 3 | 0.968085 | 0.905065 | [0.0, 0.0036900369003690036, 0.003690036900369... | [0.0, 0.0, 0.2727272727272727, 0.2727272727272... | 0.905065 | [[270, 1], [8, 3]] | {1.0} | {0.9680851063829787} |
4 | AdaBoost | 0.035461 | 0.054009 | 268 | 3 | 7 | 4 | 0.964539 | 0.792687 | [0.0, 0.0036900369003690036, 0.003690036900369... | [0.0, 0.0, 0.18181818181818182, 0.181818181818... | 0.792687 | [[268, 3], [7, 4]] | {1.0} | {0.9645390070921985} |
5 | Bagging | 0.046099 | -0.229789 | 267 | 4 | 9 | 2 | 0.953901 | 0.859611 | [0.0, 0.0, 0.0, 0.014760147601476014, 0.022140... | [0.0, 0.09090909090909091, 0.18181818181818182... | 0.859611 | [[267, 4], [9, 2]] | {0.9969465648854962} | {0.9539007092198581} |
6 | Gradient Boosting | 0.031915 | 0.148608 | 268 | 3 | 6 | 5 | 0.968085 | 0.858269 | [0.0, 0.0, 0.0036900369003690036, 0.0036900369... | [0.0, 0.09090909090909091, 0.09090909090909091... | 0.858269 | [[268, 3], [6, 5]] | {1.0} | {0.9680851063829787} |
7 | Gaussian Naive Bayes | 0.085106 | -1.270379 | 251 | 20 | 4 | 7 | 0.914894 | 0.731969 | [0.0, 0.0, 0.0, 0.0036900369003690036, 0.00369... | [0.0, 0.09090909090909091, 0.2727272727272727,... | 0.731969 | [[251, 20], [4, 7]] | {0.9267175572519084} | {0.9148936170212766} |
8 | Support Vector Machine | 0.039007 | -0.040590 | 271 | 0 | 11 | 0 | 0.960993 | 0.938947 | [0.0, 0.0, 0.0036900369003690036, 0.0036900369... | [0.0, 0.09090909090909091, 0.09090909090909091... | 0.938947 | [[271, 0], [11, 0]] | {0.9603053435114504} | {0.9609929078014184} |
# Change the below metrics as needed say we only want to get the model on accuracy but then change the sorting parameter accordingly
selected_metrics_for_comparison = [
"Mean Squared Error",
"R-squared Score",
"Accuracy",
"AUC",
"Precision",
"Recall",
"F1-score",
"True Positive",
"True Negative",
"False Positive",
"False Negative",
]
# Create a DataFrame to compare the models
df_results = pd.DataFrame(evaluation_results)
# Calculate additional metrics: Precision, Recall, F1-score, True Positive, True Negative, False Positive, False Negative
df_results["Precision"] = df_results["True Positive"] / \
(df_results["True Positive"] + df_results["False Positive"])
df_results["Recall"] = df_results["True Positive"] / \
(df_results["True Positive"] + df_results["False Negative"])
df_results["F1-score"] = 2 * (df_results["Precision"] * df_results["Recall"]) / \
(df_results["Precision"] + df_results["Recall"])
# Sort the DataFrame based on the chosen metrics (lower is better for MSE, higher for others)
df_results_sorted = df_results.sort_values(
by=selected_metrics_for_comparison, ascending=[ True, False, False, False, False, False, False, False, True, True, False])
# Display the comparison table
print("\nModel Comparison:")
print(tabulate(df_results_sorted[["Model"]+selected_metrics_for_comparison + ["Train Score", "Test Score"]], headers="keys", tablefmt="heavy_grid"))
# Select the best model based on the chosen metrics
best_models = {}
best_models_table = [] # Table to store the best models in tabular format
for metric in selected_metrics_for_comparison:
best_model_idx = df_results_sorted[metric].idxmin(
) if "Error" in metric else df_results_sorted[metric].idxmax()
best_models[metric] = df_results_sorted.loc[best_model_idx, "Model"]
best_model_name = best_models[metric]
best_model_value = df_results_sorted.loc[best_model_idx, metric]
best_models_table.append([f"Best in ({metric})", best_model_name, best_model_value])
# Display the best models in tabular format
print("\nBest Models:")
print(tabulate(best_models_table, headers=["Metric", "Model Name", "Value"], tablefmt="heavy_grid"))
# Overall Best Model based on a consensus of multiple metrics
consensus_metrics = set(selected_metrics_for_comparison)
overall_best_model_idx = df_results_sorted[selected_metrics_for_comparison].mean(axis=1).idxmax()
overall_best_model = df_results_sorted.loc[overall_best_model_idx, "Model"]
print(f"\nOverall Best Model based on {', '.join(consensus_metrics)}: '{overall_best_model}'")
Model Comparison: ┏━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ Model ┃ Mean Squared Error ┃ R-squared Score ┃ Accuracy ┃ AUC ┃ Precision ┃ Recall ┃ F1-score ┃ True Positive ┃ True Negative ┃ False Positive ┃ False Negative ┃ Train Score ┃ Test Score ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 3 ┃ Random Forest ┃ 0.0319149 ┃ 0.148608 ┃ 0.968085 ┃ 0.905065 ┃ 0.971223 ┃ 0.99631 ┃ 0.983607 ┃ 270 ┃ 3 ┃ 8 ┃ 1 ┃ {1.0} ┃ {0.9680851063829787} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 6 ┃ Gradient Boosting ┃ 0.0319149 ┃ 0.148608 ┃ 0.968085 ┃ 0.858269 ┃ 0.978102 ┃ 0.98893 ┃ 0.983486 ┃ 268 ┃ 5 ┃ 6 ┃ 3 ┃ {1.0} ┃ {0.9680851063829787} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 0 ┃ Logistic Regression ┃ 0.0319149 ┃ 0.148608 ┃ 0.968085 ┃ 0.843677 ┃ 0.985185 ┃ 0.98155 ┃ 0.983364 ┃ 266 ┃ 7 ┃ 4 ┃ 5 ┃ {0.9679389312977099} ┃ {0.9680851063829787} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 4 ┃ AdaBoost ┃ 0.035461 ┃ 0.0540087 ┃ 0.964539 ┃ 0.792687 ┃ 0.974545 ┃ 0.98893 ┃ 0.981685 ┃ 268 ┃ 4 ┃ 7 ┃ 3 ┃ {1.0} ┃ {0.9645390070921985} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 8 ┃ Support Vector Machine ┃ 0.0390071 ┃ -0.0405904 ┃ 0.960993 ┃ 0.938947 ┃ 0.960993 ┃ 1 ┃ 0.980108 ┃ 271 ┃ 0 ┃ 11 ┃ 0 ┃ {0.9603053435114504} ┃ {0.9609929078014184} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 5 ┃ Bagging ┃ 0.0460993 ┃ -0.229789 ┃ 0.953901 ┃ 0.859611 ┃ 0.967391 ┃ 0.98524 ┃ 0.976234 ┃ 267 ┃ 2 ┃ 9 ┃ 4 ┃ {0.9969465648854962} ┃ {0.9539007092198581} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 1 ┃ k-Nearest Neighbors ┃ 0.0460993 ┃ -0.229789 ┃ 0.953901 ┃ 0.829755 ┃ 0.977778 ┃ 0.97417 ┃ 0.97597 ┃ 264 ┃ 5 ┃ 6 ┃ 7 ┃ {0.9526717557251908} ┃ {0.9539007092198581} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 2 ┃ Decision Tree ┃ 0.0460993 ┃ -0.229789 ┃ 0.953901 ┃ 0.714358 ┃ 0.977778 ┃ 0.97417 ┃ 0.97597 ┃ 264 ┃ 5 ┃ 6 ┃ 7 ┃ {1.0} ┃ {0.9539007092198581} ┃ ┣━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 7 ┃ Gaussian Naive Bayes ┃ 0.0851064 ┃ -1.27038 ┃ 0.914894 ┃ 0.731969 ┃ 0.984314 ┃ 0.926199 ┃ 0.954373 ┃ 251 ┃ 7 ┃ 4 ┃ 20 ┃ {0.9267175572519084} ┃ {0.9148936170212766} ┃ ┗━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛ Best Models: ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ Metric ┃ Model Name ┃ Value ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (Mean Squared Error) ┃ Random Forest ┃ 0.0319149 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (R-squared Score) ┃ Random Forest ┃ 0.148608 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (Accuracy) ┃ Random Forest ┃ 0.968085 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (AUC) ┃ Support Vector Machine ┃ 0.938947 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (Precision) ┃ Logistic Regression ┃ 0.985185 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (Recall) ┃ Support Vector Machine ┃ 1 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (F1-score) ┃ Random Forest ┃ 0.983607 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (True Positive) ┃ Support Vector Machine ┃ 271 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (True Negative) ┃ Logistic Regression ┃ 7 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (False Positive) ┃ Support Vector Machine ┃ 11 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━┫ ┃ Best in (False Negative) ┃ Gaussian Naive Bayes ┃ 20 ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━┛ Overall Best Model based on False Positive, Precision, F1-score, Mean Squared Error, True Positive, AUC, Recall, False Negative, Accuracy, True Negative, R-squared Score: 'Random Forest'
# Models to evaulate the name and relevant parameter to take for best fit model fitting
models_dict = {
"Logistic Regression": LogisticRegression(max_iter=1000, C=1.0, solver='lbfgs'),
"k-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5, weights='uniform'),
"Decision Tree": DecisionTreeClassifier(max_depth=None, min_samples_split=2, min_samples_leaf=1),
"Random Forest": RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1),
"AdaBoost": AdaBoostClassifier(n_estimators=50, learning_rate=1.0),
"Bagging": BaggingClassifier(n_estimators=10, max_samples=1.0, max_features=1.0),
"Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3),
"Gaussian Naive Bayes": GaussianNB(),
"SVM": SVC(probability=True, C=1.0, kernel='rbf')
}
# Retrieve the value using the key
retrieved_value = models_dict.get(overall_best_model)
if retrieved_value is not None:
selected_model_name = overall_best_model
selected_model = retrieved_value
# selected_model_params = retrieved_value.get_params()
print(f"Best Model Name: {selected_model_name}")
print(f"\nRetrieved Model Instance: {selected_model}")
Best Model Name: Random Forest Retrieved Model Instance: RandomForestClassifier()
# Manually coding the best model name and changing the below parameter
# f_modelname = "Logistic Regression"
# final_model = LogisticRegression(max_iter=10000, C=1.0, solver="lbfgs")
# final_model.fit(x, y)
# Automating the above hardcoded values by using the above dictonary and for loop
f_modelname = selected_model_name
f_model = selected_model
print(f"Best Selected Model name : '{f_modelname}' & \nits parameters :\n{f_model.get_params()}")
final_model = f_model
final_model.fit(x, y)
Best Selected Model name : 'Random Forest' & its parameters : {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
# wb - write binary file
pickle.dump(final_model, open(f"{f_modelname}.pkl", "wb"))
load_model = pickle.load(open(f"{f_modelname}.pkl", "rb")) # rb = read binary
print(f"Name of loaded Model : {f_modelname}")
load_model
Name of loaded Model : Random Forest
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
# Testing the imported model
print("Length of test data: ", len(load_model.predict(X_test)))
load_model.predict(X_test)
Length of test data: 282
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
random_datasample = cleaned_df.sample(20)
random_datasample_df = random_datasample.drop("target", axis=1)
print(random_datasample_df.shape)
random_datasample_df.head()
(20, 33)
f_1 | f_2 | f_3 | f_4 | f_5 | f_7 | f_8 | f_9 | f_10 | f_11 | ... | f_33 | f_37 | f_38 | f_39 | f_41 | f_42 | f_45 | f_46 | f_47 | f_48 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
855 | 123 | 12 | 116.33 | 377.75 | 59 | 52.42 | 4.44 | 1011.0 | 0.09 | 96.1 | ... | 0.0 | 0.01 | 10.52 | 82 | 524.79 | 127.28 | 20.62 | 0 | 2475.04 | 65.88 |
148 | 139 | 56 | 1646.05 | 1534.18 | 55 | 31.73 | 5.42 | 1840.0 | 0.17 | 76.1 | ... | 0.0 | 0.01 | 24.18 | 78 | 721.11 | 223.61 | 6.12 | 0 | 3352.35 | 66.31 |
728 | 81 | 10 | 47.20 | 651.80 | 37 | 71.50 | 8.11 | 704.0 | 0.11 | 115.1 | ... | 0.0 | 0.01 | 6.12 | 102 | 402.49 | 0.00 | 0.00 | 1 | 4515.09 | 66.21 |
902 | 170 | 14 | 26.50 | 642.79 | 58 | 46.79 | 9.16 | 1048.0 | 0.20 | 108.2 | ... | 0.0 | 0.00 | 9.69 | 82 | 402.49 | 127.28 | 4.22 | 0 | 4548.47 | 66.19 |
434 | 2 | 6099 | 673.25 | 1730.74 | 13 | 25.60 | 8.10 | 61516.5 | 0.32 | 139.4 | ... | 0.0 | 0.01 | 441.23 | 133 | 0.00 | 0.00 | 0.00 | 0 | 13101.35 | 36.49 |
5 rows × 33 columns
random_datasample_df.reset_index()
index | f_1 | f_2 | f_3 | f_4 | f_5 | f_7 | f_8 | f_9 | f_10 | ... | f_33 | f_37 | f_38 | f_39 | f_41 | f_42 | f_45 | f_46 | f_47 | f_48 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 855 | 123 | 12 | 116.33 | 377.75 | 59 | 52.42 | 4.44 | 1011.0 | 0.09 | ... | 0.0 | 0.01 | 10.52 | 82 | 524.79 | 127.28 | 20.62 | 0 | 2475.04 | 65.88 |
1 | 148 | 139 | 56 | 1646.05 | 1534.18 | 55 | 31.73 | 5.42 | 1840.0 | 0.17 | ... | 0.0 | 0.01 | 24.18 | 78 | 721.11 | 223.61 | 6.12 | 0 | 3352.35 | 66.31 |
2 | 728 | 81 | 10 | 47.20 | 651.80 | 37 | 71.50 | 8.11 | 704.0 | 0.11 | ... | 0.0 | 0.01 | 6.12 | 102 | 402.49 | 0.00 | 0.00 | 1 | 4515.09 | 66.21 |
3 | 902 | 170 | 14 | 26.50 | 642.79 | 58 | 46.79 | 9.16 | 1048.0 | 0.20 | ... | 0.0 | 0.00 | 9.69 | 82 | 402.49 | 127.28 | 4.22 | 0 | 4548.47 | 66.19 |
4 | 434 | 2 | 6099 | 673.25 | 1730.74 | 13 | 25.60 | 8.10 | 61516.5 | 0.32 | ... | 0.0 | 0.01 | 441.23 | 133 | 0.00 | 0.00 | 0.00 | 0 | 13101.35 | 36.49 |
5 | 473 | 41 | 134 | 1260.22 | 1237.23 | 70 | 27.52 | 11.30 | 3374.5 | 0.41 | ... | 0.0 | 0.02 | 60.43 | 133 | 877.85 | 391.51 | 4.42 | 0 | 8095.91 | 36.86 |
6 | 409 | 146 | 111 | 827.05 | 1260.37 | 118 | 40.58 | 6.66 | 2980.0 | 0.16 | ... | 0.0 | 0.01 | 32.00 | 85 | 894.43 | 471.70 | 3.94 | 0 | 6277.01 | 66.03 |
7 | 96 | 86 | 86 | 769.73 | 1761.26 | 55 | 37.55 | 6.27 | 3090.0 | 0.17 | ... | 0.0 | 0.01 | 44.41 | 78 | 1400.89 | 180.28 | 14.93 | 1 | 15720.91 | 66.30 |
8 | 235 | 103 | 214 | 1186.12 | 969.47 | 145 | 31.31 | 6.94 | 6440.0 | 0.22 | ... | 0.0 | 0.01 | 77.52 | 64 | 1081.67 | 970.82 | 1.76 | 0 | 5037.66 | 65.94 |
9 | 362 | 82 | 71 | 104.75 | 1357.72 | 96 | 42.37 | 4.83 | 1710.0 | 0.11 | ... | 0.0 | 0.01 | 16.47 | 85 | 608.28 | 300.00 | 2.70 | 0 | 32773.88 | 65.97 |
10 | 808 | 76 | 16 | 19.00 | 584.00 | 62 | 50.12 | 7.80 | 1154.0 | 0.16 | ... | 0.0 | 0.01 | 10.28 | 82 | 649.00 | 127.28 | 10.20 | 1 | 3862.06 | 66.11 |
11 | 705 | 58 | 15 | 44.27 | 631.53 | 100 | 71.33 | 11.22 | 1191.0 | 0.16 | ... | 0.0 | 0.01 | 11.67 | 102 | 270.00 | 270.00 | 1.50 | 0 | 3452.52 | 66.18 |
12 | 558 | 64 | 75 | 1054.00 | 2724.57 | 131 | 26.72 | 5.83 | 1845.0 | 0.22 | ... | 0.0 | 0.02 | 32.28 | 143 | 0.00 | 0.00 | 0.00 | 0 | 24897.96 | 36.82 |
13 | 760 | 28 | 23 | 57.00 | 619.65 | 68 | 53.61 | 9.15 | 2075.0 | 0.17 | ... | 0.0 | 0.00 | 23.11 | 82 | 1053.42 | 180.00 | 12.88 | 1 | 4371.27 | 66.17 |
14 | 425 | 177 | 59 | 844.44 | 1106.71 | 101 | 40.80 | 7.73 | 1910.0 | 0.19 | ... | 0.0 | 0.01 | 24.73 | 85 | 509.90 | 304.14 | 2.45 | 0 | 4065.36 | 65.96 |
15 | 312 | 19 | 80 | 1001.17 | 876.95 | 7 | 39.35 | 9.33 | 1560.0 | 0.24 | ... | 0.0 | 0.00 | 12.17 | 85 | 608.28 | 350.00 | 2.35 | 0 | 3169.32 | 65.87 |
16 | 874 | 142 | 16 | 6.19 | 509.25 | 73 | 55.81 | 9.17 | 1228.0 | 0.16 | ... | 0.0 | 0.00 | 11.64 | 82 | 569.21 | 201.25 | 4.61 | 0 | 3807.97 | 66.01 |
17 | 748 | 16 | 22 | 31.23 | 412.77 | 62 | 49.77 | 8.49 | 1985.0 | 0.17 | ... | 0.0 | 0.01 | 22.11 | 82 | 1049.57 | 127.28 | 41.23 | 1 | 2312.51 | 65.89 |
18 | 812 | 80 | 10 | 73.30 | 231.60 | 74 | 48.00 | 7.24 | 884.0 | 0.15 | ... | 0.0 | 0.01 | 9.65 | 82 | 484.66 | 90.00 | 8.98 | 1 | 3644.29 | 65.67 |
19 | 159 | 150 | 119 | 1531.03 | 1772.45 | 158 | 37.89 | 8.56 | 3620.0 | 0.23 | ... | 0.0 | 0.01 | 44.05 | 78 | 474.34 | 696.42 | 1.20 | 0 | 4759.79 | 66.41 |
20 rows × 34 columns
random_datasample_df.to_csv("20_random_sample.csv", index=False)
testsample_df = pd.read_csv("20_random_sample.csv")
print(
"Shape of loaded sample dataframe:",
testsample_df.shape,
"\n\nSample Dataframe contents",
)
testsample_df
Shape of loaded sample dataframe: (20, 33) Sample Dataframe contents
f_1 | f_2 | f_3 | f_4 | f_5 | f_7 | f_8 | f_9 | f_10 | f_11 | ... | f_33 | f_37 | f_38 | f_39 | f_41 | f_42 | f_45 | f_46 | f_47 | f_48 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 123 | 12 | 116.33 | 377.75 | 59 | 52.42 | 4.44 | 1011.0 | 0.09 | 96.1 | ... | 0.0 | 0.01 | 10.52 | 82 | 524.79 | 127.28 | 20.62 | 0 | 2475.04 | 65.88 |
1 | 139 | 56 | 1646.05 | 1534.18 | 55 | 31.73 | 5.42 | 1840.0 | 0.17 | 76.1 | ... | 0.0 | 0.01 | 24.18 | 78 | 721.11 | 223.61 | 6.12 | 0 | 3352.35 | 66.31 |
2 | 81 | 10 | 47.20 | 651.80 | 37 | 71.50 | 8.11 | 704.0 | 0.11 | 115.1 | ... | 0.0 | 0.01 | 6.12 | 102 | 402.49 | 0.00 | 0.00 | 1 | 4515.09 | 66.21 |
3 | 170 | 14 | 26.50 | 642.79 | 58 | 46.79 | 9.16 | 1048.0 | 0.20 | 108.2 | ... | 0.0 | 0.00 | 9.69 | 82 | 402.49 | 127.28 | 4.22 | 0 | 4548.47 | 66.19 |
4 | 2 | 6099 | 673.25 | 1730.74 | 13 | 25.60 | 8.10 | 61516.5 | 0.32 | 139.4 | ... | 0.0 | 0.01 | 441.23 | 133 | 0.00 | 0.00 | 0.00 | 0 | 13101.35 | 36.49 |
5 | 41 | 134 | 1260.22 | 1237.23 | 70 | 27.52 | 11.30 | 3374.5 | 0.41 | 55.8 | ... | 0.0 | 0.02 | 60.43 | 133 | 877.85 | 391.51 | 4.42 | 0 | 8095.91 | 36.86 |
6 | 146 | 111 | 827.05 | 1260.37 | 118 | 40.58 | 6.66 | 2980.0 | 0.16 | 93.1 | ... | 0.0 | 0.01 | 32.00 | 85 | 894.43 | 471.70 | 3.94 | 0 | 6277.01 | 66.03 |
7 | 86 | 86 | 769.73 | 1761.26 | 55 | 37.55 | 6.27 | 3090.0 | 0.17 | 69.6 | ... | 0.0 | 0.01 | 44.41 | 78 | 1400.89 | 180.28 | 14.93 | 1 | 15720.91 | 66.30 |
8 | 103 | 214 | 1186.12 | 969.47 | 145 | 31.31 | 6.94 | 6440.0 | 0.22 | 83.1 | ... | 0.0 | 0.01 | 77.52 | 64 | 1081.67 | 970.82 | 1.76 | 0 | 5037.66 | 65.94 |
9 | 82 | 71 | 104.75 | 1357.72 | 96 | 42.37 | 4.83 | 1710.0 | 0.11 | 103.8 | ... | 0.0 | 0.01 | 16.47 | 85 | 608.28 | 300.00 | 2.70 | 0 | 32773.88 | 65.97 |
10 | 76 | 16 | 19.00 | 584.00 | 62 | 50.12 | 7.80 | 1154.0 | 0.16 | 112.3 | ... | 0.0 | 0.01 | 10.28 | 82 | 649.00 | 127.28 | 10.20 | 1 | 3862.06 | 66.11 |
11 | 58 | 15 | 44.27 | 631.53 | 100 | 71.33 | 11.22 | 1191.0 | 0.16 | 102.0 | ... | 0.0 | 0.01 | 11.67 | 102 | 270.00 | 270.00 | 1.50 | 0 | 3452.52 | 66.18 |
12 | 64 | 75 | 1054.00 | 2724.57 | 131 | 26.72 | 5.83 | 1845.0 | 0.22 | 57.2 | ... | 0.0 | 0.02 | 32.28 | 143 | 0.00 | 0.00 | 0.00 | 0 | 24897.96 | 36.82 |
13 | 28 | 23 | 57.00 | 619.65 | 68 | 53.61 | 9.15 | 2075.0 | 0.17 | 89.8 | ... | 0.0 | 0.00 | 23.11 | 82 | 1053.42 | 180.00 | 12.88 | 1 | 4371.27 | 66.17 |
14 | 177 | 59 | 844.44 | 1106.71 | 101 | 40.80 | 7.73 | 1910.0 | 0.19 | 77.2 | ... | 0.0 | 0.01 | 24.73 | 85 | 509.90 | 304.14 | 2.45 | 0 | 4065.36 | 65.96 |
15 | 19 | 80 | 1001.17 | 876.95 | 7 | 39.35 | 9.33 | 1560.0 | 0.24 | 128.2 | ... | 0.0 | 0.00 | 12.17 | 85 | 608.28 | 350.00 | 2.35 | 0 | 3169.32 | 65.87 |
16 | 142 | 16 | 6.19 | 509.25 | 73 | 55.81 | 9.17 | 1228.0 | 0.16 | 105.5 | ... | 0.0 | 0.00 | 11.64 | 82 | 569.21 | 201.25 | 4.61 | 0 | 3807.97 | 66.01 |
17 | 16 | 22 | 31.23 | 412.77 | 62 | 49.77 | 8.49 | 1985.0 | 0.17 | 89.8 | ... | 0.0 | 0.01 | 22.11 | 82 | 1049.57 | 127.28 | 41.23 | 1 | 2312.51 | 65.89 |
18 | 80 | 10 | 73.30 | 231.60 | 74 | 48.00 | 7.24 | 884.0 | 0.15 | 91.6 | ... | 0.0 | 0.01 | 9.65 | 82 | 484.66 | 90.00 | 8.98 | 1 | 3644.29 | 65.67 |
19 | 150 | 119 | 1531.03 | 1772.45 | 158 | 37.89 | 8.56 | 3620.0 | 0.23 | 82.2 | ... | 0.0 | 0.01 | 44.05 | 78 | 474.34 | 696.42 | 1.20 | 0 | 4759.79 | 66.41 |
20 rows × 33 columns
# making prediction on random data
predicted_data = load_model.predict(testsample_df)
print(f"The predicted data from {f_modelname} model:\n", predicted_data)
The predicted data from Random Forest model: [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
# Compare the actual data and predicted data
prediction_data = random_datasample.copy()
prediction_data["predicted_target"] = predicted_data
# Print the actual and predicted data
print(f"Actual Data and Predicted Data Comparision based on {f_modelname} model:\n")
# print(prediction_data[["target", "predicted_target"]])
comparision = {
"Actual Target": random_datasample["target"], "Predicted Target": predicted_data}
final_results = pd.DataFrame(comparision)
final_results
Actual Data and Predicted Data Comparision based on Random Forest model:
Actual Target | Predicted Target | |
---|---|---|
855 | 0 | 0 |
148 | 0 | 0 |
728 | 0 | 0 |
902 | 0 | 0 |
434 | 0 | 0 |
473 | 0 | 0 |
409 | 0 | 0 |
96 | 0 | 0 |
235 | 0 | 0 |
362 | 1 | 1 |
808 | 0 | 0 |
705 | 0 | 0 |
558 | 0 | 0 |
760 | 0 | 0 |
425 | 0 | 0 |
312 | 0 | 0 |
874 | 0 | 0 |
748 | 0 | 0 |
812 | 0 | 0 |
159 | 0 | 0 |
# Calculate the number of correct predictions
correct_predictions = (
prediction_data["predicted_target"] == prediction_data["target"]).sum()
# Calculate the percentage of correct predictions
percentage_correct_predictions = (
correct_predictions / len(prediction_data)) * 100
# Print the result
print(f"\nPercentage of Correct Predictions: {percentage_correct_predictions:.2f}%")
if (percentage_correct_predictions >= 90):
print(f"\nOur model based on '{f_modelname}' is well trained having prediction accuracy of {percentage_correct_predictions:.2f}%")
else:
print(f"Our model based on '{f_modelname}' needs to be trained more to achieve atleast 95% prediction accuracy from our current results : {percentage_correct_predictions:.2f}%")
Percentage of Correct Predictions: 100.00% Our model based on 'Random Forest' is well trained having prediction accuracy of 100.00%
# Saving the final results in a output file
final_results.to_csv('final_results.csv', index=False)
with open('final_results.txt', 'w') as f:
f.write(final_results.to_string())
with open('final_results.txt', 'a') as f:
f.write(f"\n\n---------------------------------------\nPrinting the results of our {f_modelname} prediction on random 20 data samples.")
f.write('\nNumber of correct predictions: {}\n'.format(sum(final_results['Actual Target'] == final_results['Predicted Target'])))
f.write('Percentage of correct predictions: {}%'.format(100 * sum(final_results['Actual Target'] == final_results['Predicted Target']) / len(final_results)))
# Print the result in output file
if (percentage_correct_predictions >= 90):
f.write(f"\nOur model based on '{f_modelname}' is well trained having prediction accuracy of {percentage_correct_predictions:.2f}%")
else:
f.write(f"Our model based on '{f_modelname}' needs to be trained more to achieve atleast 95% prediction accuracy from our current results : {percentage_correct_predictions:.2f}%")