numpy
pandas
scikit-learn
seaborn
catboost
Run the following cell to install the packages.
#
# Required Packages
# Run this cell to install required packages.
#
%pip install "catboost>=1.0" "matplotlib>=2.0" "numpy>=1.19" "pandas>=1.1" "scikit-learn>=0.22.2" "seaborn>=0.11"
import random as rnd
import warnings
import matplotlib.pyplot as plt
import numpy as np
# data analysis and wrangling
import pandas as pd
# visualization
import seaborn as sns
# load dataset
from catboost.datasets import titanic
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# Model traning & selection
from sklearn.model_selection import (
KFold,
cross_val_predict,
cross_val_score,
train_test_split,
)
# ML models
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
warnings.filterwarnings("ignore")
We start by acquiring datasets using catboost library.
We also combine these datasets into dataset
to run certain operations on both datasets together.
train_df, test_df = titanic()
dataset = {
"train": train_df,
"test": test_df,
}
# We need `PassengerId` feature to make submit file.
passenger_id = test_df["PassengerId"]
Survived
, Sex
, and Embarked
.Pclass
.Age
, Fare
.SibSp
, Parch
.Ticket
, Cabin
.Cabin
> Age
> Embarked
Cabin
> Age
>> Fare
Ticket
: It contains high ratio of duplicates.Cabin
: It has many null values(77%, 38% for train and test set).PassengerId
: It's not contribute to survival.Name
: Title(like 'Mr' or 'Mrs' ...) may contribute to survival.print("Preview the data")
display(train_df.head())
print("\n" + "=" * 40)
print("Check blank, null or empty values\n")
print("\tTrain")
display(train_df.isnull().sum() / len(train_df))
print("_" * 40)
print("\tTest")
display(test_df.isnull().sum() / len(train_df))
print("\n" + "=" * 40)
print("Check the distribution of numerical features")
display(train_df.describe())
print("\n" + "=" * 40)
print("Check the distribution of categorical features")
display(train_df.describe(include=["O"]))
print("\n" + "=" * 40)
print("Check the duplication rate of `Ticket` feature")
train_df.Ticket.duplicated().sum() / len(train_df)
Preview the data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
======================================== Check blank, null or empty values Train
PassengerId 0.000000 Survived 0.000000 Pclass 0.000000 Name 0.000000 Sex 0.000000 Age 0.198653 SibSp 0.000000 Parch 0.000000 Ticket 0.000000 Fare 0.000000 Cabin 0.771044 Embarked 0.002245 dtype: float64
________________________________________ Test
PassengerId 0.000000 Pclass 0.000000 Name 0.000000 Sex 0.000000 Age 0.096521 SibSp 0.000000 Parch 0.000000 Ticket 0.000000 Fare 0.001122 Cabin 0.367003 Embarked 0.000000 dtype: float64
======================================== Check the distribution of numerical features
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
======================================== Check the distribution of categorical features
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Braund, Mr. Owen Harris | male | 347082 | B96 B98 | S |
freq | 1 | 577 | 7 | 4 | 644 |
======================================== Check the duplication rate of `Ticket` feature
0.2356902356902357
Pclass
: Passengers with Pclass
1 has a higher rate of surviving than other Pclasses.Sex
: Female
passengers are more likely to survive than male passengersSibSp
and Parch
: Survival rate differs depending on whether you are alone or when you have one other family member.pivot_features = ["Pclass", "Sex", "SibSp", "Parch"]
f, ax = plt.subplots(2, 4, figsize=(16, 10))
for i, pivot_feature in enumerate(pivot_features):
train_df[pivot_feature].value_counts().sort_index().plot.bar(x=pivot_feature, y="Survived", ax=ax[0, i])
sns.countplot(pivot_feature, hue="Survived", data=train_df, ax=ax[1, i])
ax[0, i].set_title(pivot_feature)
plt.show()
Age
by Survived
¶Age
by Pclass
x Survived
¶Pclass
Age
10 looks to be good irrespective of the Pclass
.Survived
rate by Embarked
x Pclass
x Sex
¶Pclass
1 & 2 irrespecitive of the Pclass
.Pclass
3 Passenegers as the survival rate for both men and women is very low.Fare
by Embarked
x Survived
¶Age
, Pclass
, Embarked
, Fare
feature to model training.Age
, Fare
feature.Age
, Embarked
, Fare
feature for missing valuesfig12, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.histplot(data=train_df, y="Age", hue="Survived", ax=ax[0])
ax[0].set_title("Fig 1 : Histogram of Age by Survived", size=20)
sns.violinplot(data=train_df, x="Pclass", y="Age", hue="Survived", split=True, ax=ax[1])
ax[1].set_title("Fig 2 : Distribution of Age by Pclass x Survived", size=20)
ax[1].set_yticks(range(0, 110, 10))
plt.show()
fig3 = sns.factorplot("Pclass", "Survived", data=train_df, hue="Sex", col="Embarked", palette="deep")
fig3.fig.subplots_adjust(top=0.8)
fig3.fig.suptitle("Fig 3 : Survived rate by Embarked x Pclass x Sex", size=20)
plt.show()
fig4 = sns.FacetGrid(train_df, col="Embarked", size=4, aspect=1.6)
fig4.map(sns.barplot, "Survived", "Fare", alpha=0.5, ci=None)
fig4.add_legend()
fig4.fig.subplots_adjust(top=0.8)
fig4.fig.suptitle("Fig 4 : Fare by Embarked x Survived", size=20)
plt.show()
Ticket
, Cabin
, PassengerId
for ds_name, ds in dataset.items():
ds.drop(["Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)
dataset[ds_name] = ds
Title
¶Title
and test correlation between Title
and Survived
..
) within Name
feature.Master
or Mr
titles were female
.Miss
, Mrs
) or didTitle
feature for model training.for ds_name, ds in dataset.items():
ds["Title"] = ds.Name.str.extract(
" ([A-Za-z]+)\.",
expand=False,
)
ds["Title"] = ds["Title"].replace(
["Lady", "Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer", "Dona"],
"Rare",
)
ds["Title"] = ds["Title"].replace("Mlle", "Miss")
ds["Title"] = ds["Title"].replace("Ms", "Miss")
ds["Title"] = ds["Title"].replace("Mme", "Mrs")
ds["Title"] = ds["Title"].fillna(0)
ds = ds.drop(["Name"], axis=1)
dataset[ds_name] = ds
pd.crosstab(dataset["train"]["Title"], dataset["train"]["Sex"]).plot.bar()
plt.title("the number of samples by Title x Sex", size=15)
plt.show()
dataset["train"][["Title", "Survived"]].groupby(["Title"], as_index=False).mean().plot.bar(x="Title")
plt.title("the survival rate by Title", size=15)
plt.show()
dataset["train"][["Title", "Age"]].groupby(["Title"], as_index=False).mean().plot.bar(x="Title")
plt.title("the mean value of Age by Title", size=15)
plt.show()
display(dataset["train"].head())
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Mr |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | Mrs |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Miss |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | Mrs |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Mr |
Age
, Embarked
, Fare
# Cross tab Pclass x Sex(X ~ P(Age | Sex, Pclass))
grid = sns.FacetGrid(dataset["train"], row="Pclass", col="Sex", size=2.2, aspect=1.6)
grid.map(plt.hist, "Age", alpha=0.5, bins=20)
grid.add_legend()
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle("Histogram of P(Age | Pclass, Sex)", size=15)
plt.show()
freq_port = dataset["train"].Embarked.dropna().mode()[0]
categ_sex = dataset["train"].Sex.unique()
for ds_name, ds in dataset.items():
# Complete Age feature with the median of X ~ P(Age | Sex, Pclass)
for i in range(2):
for j in range(3):
guess_df = ds[(ds["Sex"] == categ_sex[i]) & (ds["Pclass"] == j + 1)]["Age"].dropna()
ds.loc[(ds.Age.isnull()) & (ds.Sex == categ_sex[i]) & (ds.Pclass == j + 1), "Age"] = guess_df.median()
ds["Embarked"].fillna(freq_port, inplace=True)
ds["Fare"].fillna(ds["Fare"].dropna().median(), inplace=True)
dataset[ds_name] = ds
Age
, Fare
# Banding `Age`, `Fare` feature
dataset["train"]["AgeBand"] = pd.cut(dataset["train"]["Age"], 5)
dataset["train"][["AgeBand", "Survived"]].groupby(["AgeBand"], as_index=False,).mean().sort_values(
by="AgeBand",
ascending=True,
).plot.bar(x="AgeBand")
dataset["train"] = dataset["train"].drop(["AgeBand"], axis=1)
dataset["train"]["FareBand"] = pd.qcut(dataset["train"]["Fare"], 4)
dataset["train"][["FareBand", "Survived"]].groupby(["FareBand"], as_index=False,).mean().sort_values(
by="FareBand",
ascending=True,
).plot.bar(x="FareBand")
dataset["train"] = dataset["train"].drop(["FareBand"], axis=1)
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
sex_mapping = {"female": 1, "male": 0}
embarked_mapping = {"S": 0, "C": 1, "Q": 2}
for ds_name, ds in dataset.items():
# Convert Numeric feature to Ordinal : `Age`, `Fare`
ds.loc[ds["Age"] <= 16, "Age"] = 0
ds.loc[(ds["Age"] > 16) & (ds["Age"] <= 32), "Age"] = 1
ds.loc[(ds["Age"] > 32) & (ds["Age"] <= 48), "Age"] = 2
ds.loc[(ds["Age"] > 48) & (ds["Age"] <= 64), "Age"] = 3
ds.loc[ds["Age"] > 64, "Age"]
ds["Age"] = ds["Age"].astype(int)
ds.loc[ds["Fare"] <= 7.91, "Fare"] = 0
ds.loc[(ds["Fare"] > 7.91) & (ds["Fare"] <= 14.454), "Fare"] = 1
ds.loc[(ds["Fare"] > 14.454) & (ds["Fare"] <= 31), "Fare"] = 2
ds.loc[ds["Fare"] > 31, "Fare"] = 3
ds["Fare"] = ds["Fare"].astype(int)
# Convert Categorical feature to numeric : `Title`, `Sex`
ds["Title"] = ds["Title"].map(title_mapping)
ds["Title"] = ds["Title"].fillna(0)
ds["Sex"] = ds["Sex"].map(sex_mapping).astype(int)
ds["Embarked"] = ds["Embarked"].map(embarked_mapping).astype(int)
dataset[ds_name] = ds
display(dataset["train"].head())
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 2 | 1 | 0 | 3 | 1 | 3 |
2 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 1 | 0 | 0 | 1 | 0 | 2 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 2 | 1 | 0 | 3 | 0 | 3 |
4 | 0 | 3 | Allen, Mr. William Henry | 0 | 2 | 0 | 0 | 1 | 0 | 1 |
IsAlone
, Age*Class
¶for ds_name, ds in dataset.items():
# We can create a new feature for FamilySize which combines Parch and SibSp. This will enable us to drop Parch and SibSp from our datasets.
ds["FamilySize"] = ds["SibSp"] + ds["Parch"] + 1
# We can create another feature called IsAlone.
ds["IsAlone"] = 0
ds.loc[ds["FamilySize"] == 1, "IsAlone"] = 1
# Let us drop Parch, SibSp, and FamilySize features in favor of IsAlone.
ds = ds.drop(["Parch", "SibSp"], axis=1)
# We can also create an artificial feature combining Age and Pclass.
ds["Age*Class"] = ds.Age * ds.Pclass
dataset[ds_name] = ds
dataset["train"][["FamilySize", "Survived"]].groupby(
["FamilySize"],
as_index=False,
).mean().plot.bar(x="FamilySize")
dataset["train"][["IsAlone", "Survived"]].groupby(
["IsAlone"],
as_index=False,
).mean().plot.bar(x="IsAlone")
dataset["train"][["Age*Class", "Survived"]].groupby(
["Age*Class"],
as_index=False,
).mean().plot.bar(x="Age*Class")
for ds_name, ds in dataset.items():
ds = ds.drop(["FamilySize"], axis=1)
dataset[ds_name] = ds
train, val = train_test_split(dataset["train"], test_size=0.3, random_state=0, stratify=dataset["train"]["Survived"])
X_train = train.iloc[:, 1:]
Y_train = train.iloc[:, :1]
X_val = val.iloc[:, 1:]
Y_val = val.iloc[:, :1]
X = dataset["train"][dataset["train"].columns[1:]]
Y = dataset["train"]["Survived"]
clf_gsnb = GaussianNB()
clf_gsnb.fit(X_train, Y_train)
pred_gsnb = clf_gsnb.predict(X_val)
acc_gsnb = metrics.accuracy_score(pred_gsnb, Y_val)
print(f"The accuracy of the Gaussian NB is {acc_gsnb}")
The accuracy of the Gaussian NB is 0.7574626865671642
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_knn.fit(X_train, Y_train)
pred_knn = clf_knn.predict(X_val)
acc_knn = metrics.accuracy_score(pred_knn, Y_val)
print(f"The accuracy of the KNN is {acc_knn}")
search_range = list(range(1, 11))
a = pd.Series()
for i in search_range:
model = KNeighborsClassifier(n_neighbors=i)
model.fit(X_train, Y_train)
prediction = model.predict(X_val)
a = a.append(pd.Series(metrics.accuracy_score(prediction, Y_val)))
plt.plot(search_range, a)
plt.xticks(search_range)
fig = plt.gcf()
fig.set_size_inches(12, 6)
plt.show()
print("Accuracies for different values of n are:", a.values, "with the max value as ", a.values.max())
del a, i, search_range, model, prediction, fig
The accuracy of the KNN is 0.7985074626865671
Accuracies for different values of n are: [0.75 0.76119403 0.79104478 0.76492537 0.79850746 0.76865672 0.79477612 0.78358209 0.7761194 0.79477612] with the max value as 0.7985074626865671
# Logistic Regression
clf_lg = LogisticRegression()
clf_lg.fit(X_train, Y_train)
pred_lg = clf_lg.predict(X_val)
acc_lg = metrics.accuracy_score(pred_lg, Y_val)
print(f"The accuracy of the Logistic Regression is {acc_lg}")
# Check Coefficients of the features
coeff_df = pd.DataFrame(dataset["train"].columns.delete(0))
coeff_df.columns = ["Feature"]
coeff_df["Correlation"] = pd.Series(clf_lg.coef_[0])
display(coeff_df.sort_values(by="Correlation", ascending=False))
The accuracy of the Logistic Regression is 0.8246268656716418
Feature | Correlation | |
---|---|---|
1 | Sex | 2.173073 |
5 | Title | 0.373390 |
6 | IsAlone | 0.255659 |
2 | Age | 0.231659 |
4 | Embarked | 0.157717 |
3 | Fare | 0.093260 |
7 | Age*Class | -0.249048 |
0 | Pclass | -0.676265 |
clf_rf = RandomForestClassifier(n_estimators=150, random_state=42)
clf_rf.fit(X_train, Y_train)
pred_rf = clf_rf.predict(X_val)
acc_rf = metrics.accuracy_score(pred_rf, Y_val)
print(f"The accuracy of the Random Forest is {acc_rf}")
search_range = list(range(100, 300, 50))
a = pd.Series()
for i in search_range:
model = RandomForestClassifier(n_estimators=i, random_state=42)
model.fit(X_train, Y_train)
prediction = model.predict(X_val)
a = a.append(pd.Series(metrics.accuracy_score(prediction, Y_val)))
plt.plot(search_range, a)
plt.xticks(search_range)
fig = plt.gcf()
fig.set_size_inches(12, 6)
plt.show()
print("Accuracies for different values of n are:", a.values, "with the max value as ", a.values.max())
del a, i, search_range, model, prediction, fig
The accuracy of the Random Forest is 0.8134328358208955
Accuracies for different values of n are: [0.80223881 0.81343284 0.81343284 0.81343284] with the max value as 0.8134328358208955
kfold = KFold(n_splits=10) # k=10, split the data into 10 equal parts
cv_mean = []
accuracy = []
std = []
classifiers = ["Gaussian Naive Bayes", "KNN", "Logistic Regression", "Random Forest"]
model_list = [
GaussianNB(),
KNeighborsClassifier(n_neighbors=5),
LogisticRegression(),
RandomForestClassifier(n_estimators=150, random_state=42),
]
for model in model_list:
cv_result = cross_val_score(model, X, Y, cv=kfold, scoring="accuracy")
cv_mean.append(cv_result.mean())
std.append(cv_result.std())
accuracy.append(cv_result)
df_kfold_result = pd.DataFrame(
{"CV Mean": cv_mean, "Std": std},
index=classifiers,
)
display(df_kfold_result)
plt.subplots(figsize=(12, 6))
box = pd.DataFrame(accuracy, index=classifiers)
box.T.boxplot()
plt.show()
CV Mean | Std | |
---|---|---|
Gaussian Naive Bayes | 0.711573 | 0.071717 |
KNN | 0.801411 | 0.039478 |
Logistic Regression | 0.804707 | 0.038364 |
Random Forest | 0.800287 | 0.038103 |
f, ax = plt.subplots(2, 2, figsize=(10, 10))
y_pred = cross_val_predict(GaussianNB(), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[0, 0], annot=True, fmt="2.0f")
ax[0, 0].set_title("Matrix for Gaussian NB")
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[0, 1], annot=True, fmt="2.0f")
ax[0, 1].set_title("Matrix for KNN")
y_pred = cross_val_predict(LogisticRegression(), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[1, 0], annot=True, fmt="2.0f")
ax[1, 0].set_title("Matrix for Logistic Regression")
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100), X, Y, cv=10)
sns.heatmap(confusion_matrix(Y, y_pred), ax=ax[1, 1], annot=True, fmt="2.0f")
ax[1, 1].set_title("Matrix for Random-Forests")
plt.subplots_adjust(hspace=0.2, wspace=0.2)
plt.show()
ensemble_clf = VotingClassifier(
estimators=list(zip(classifiers, model_list)),
voting="hard",
).fit(X_train, Y_train)
print("The accuracy for ensembled model is:", ensemble_clf.score(X_val, Y_val))
cross = cross_val_score(ensemble_clf, X, Y, cv=10, scoring="accuracy")
print("The cross validated score is", cross.mean())
The accuracy for ensembled model is: 0.8283582089552238 The cross validated score is 0.8092759051186018
Y_pred = ensemble_clf.predict(dataset["test"])
Y_pred
df_submission = pd.DataFrame({"PassengerId": passenger_id, "Survived": Y_pred})
display(df_submission.head())
df_submission.to_csv("submission.csv", index=False)
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 0 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
Our submission to the competition site Kaggle results in scoring 3,199 of 13,802 competition entries.