Predicting Survival of Passengers on the RMS Titanic¶

In this notebook various data from the passengers aboard the RMS is loaded. This data is then analyzed for significant correlation with surviving the disaster. Ultimately, models are produced to predict the survivorship of passengers based on details of their passage.

Let's import the libraries we'll need.

In [2]:

import pandas as pd
import numpy as np
import os
from scipy import stats
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn import neighbors
from sklearn import preprocessing

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline 

matplotlib.rcParams.update({'font.size': 16})

The data is imported and a several rows are printed.

In [3]:

df = pd.read_csv('train.csv')
print(df.shape)
no_passengers = df.shape[0]
print(df.dtypes)
df[0:10]

(891, 12)
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Out[3]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

Three columns have NaN. Let's address the port of Embarkation as well as age.

In [4]:

len(df[df['Cabin'].isnull()])/len(df)

Out[4]:

0.7710437710437711

MISSING DATA¶

Port of Embarkation¶

We inspect the Embarked column for missing values.

In [5]:

df[df['Embarked'].isnull()]

Out[5]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

Two values are found. Let's build a model based on everyone else on board and see if we can guess where Miss Icard and Mrs. Stone got onboard. We assume that the passenger's fare, class, and where they embarked are related. Let's define the variables.

In [6]:

df2 = df.loc[:,['Pclass','Fare','Embarked']]
df2.dropna(inplace=True) #Dropping the target of our model

# Prepare X
X = np.array(df2.loc[:,['Pclass','Fare']]) # Casting to an np.arr
scaler = preprocessing.Normalizer().fit(X) 
X = scaler.transform(X)  # Scaling values

# Prepare y
ports = ['C','Q','S']
emb_dum = df2['Embarked'].replace(ports,[0,1,2]) # Transforming categorical data into numeric
y = np.array(emb_dum) # Casting to an np.arr

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15)

A classifier clf is defined and the data fit to it.

In [7]:

fig = plt.figure(figsize=(25,20))
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train) 

Out[7]:

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

<Figure size 1800x1440 with 0 Axes>

In [8]:

f1_score(y_test,clf.predict(X_test),average='weighted')

Out[8]:

0.9106667918270284

In [9]:

plt.figure(figsize=(8,4))
c_mat = confusion_matrix(y_test,clf.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix = c_mat, display_labels=ports)
disp.plot(values_format='d')
plt.savefig('./embarked_dt.png',bbox_inches='tight',dpi=300)

<Figure size 576x288 with 0 Axes>

In [10]:

dot_data = tree.export_graphviz(clf, out_file='embarked_dt.dot', 
                     feature_names=['Pclass','Fare'],
                    filled=True, rounded=True,  
                     special_characters=True)
graph = pgv.AGraph('./embarked_dt.dot')  
#graph.draw()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-3b4ff03972e8> in <module>
      3                     filled=True, rounded=True,
      4                      special_characters=True)
----> 5 graph = pgv.AGraph('./embarked_dt.dot')
      6 #graph.draw()

NameError: name 'pgv' is not defined

Predicting the port of our passengers:

In [11]:

x = scaler.transform(np.array([1,80]).reshape(1,-1)) # Both passengers paid 80 pounds for a first class ticket
print('[0,1,2]')
print(ports)
print(clf.predict(x))

[0,1,2]
['C', 'Q', 'S']
[2]

The model calculates the port of embarkation for both passengers to be Southampton. Southampton is imputed.

In [12]:

df['Embarked'].fillna('S', inplace=True)

Let's check to see if we missed anyone

In [13]:

df[df['Embarked'].isnull()]

Out[13]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked

We did not miss anyone. Let's now look towards age.

Age¶

In [14]:

df[df['Age'].isnull()]

Out[14]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
...	...	...	...	...	...	...	...	...	...	...	...	...
859	860	0	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

177 rows × 12 columns

Oh dear. There is a significant number of passengers whose age is unknown. Again we will build a model with the objective of calculating the age of passengers we don't know. This time a k-Neighbors regressor is used and fit to the passengers we have data for. First let's note the indicies of the passengers we will be targeting.

In [15]:

nullage = df[df['Age'].isnull()].index.tolist()
nullage[0:5]

Out[15]:

[5, 17, 19, 26, 28]

In [17]:

matplotlib.rcParams.update({'font.size': 20})

df2 = df.loc[:,['Pclass','Sex','SibSp','Parch','Fare','Embarked','Age']] # These are our variables of interest

# Quick and dirty: Categorical variables are replaced with an integer
df2['Sex'].replace(['female','male'],[0,1],inplace=True) 
df2['Embarked'].replace(['C','Q','S'],[0,1,2],inplace=True)

# A copy of out model inputs is made for graphing later
df3 = df2.copy()

# Remove NaNs and cast to np.arr
df2.dropna(inplace=True)
data = np.array(df2)

# Defining model inputs
X = data[:,:6]
y = data[:,6]

list_knn = []

its=1000
neighbor_range = range(1,20)
index = 0
for i in neighbor_range:
    RMSE_LIST= []
    for j in range(0,its):
    
        X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
        # Defining a classifier. 
        n_neighbors = i
        knn = neighbors.KNeighborsRegressor(n_neighbors,weights='distance',algorithm='brute',n_jobs=-1)
        knn.fit(X_train,y_train)

        E = np.subtract(knn.predict(X_test),y_test)
        SE = np.power(E,2)
        MSE = np.sum(SE/len(y_test))
        RMSE = np.sqrt(MSE)
        RMSE_LIST.append(RMSE)
        os.sys.stdout.write(f'{index}\r')
        index = index + 1
    
    list_knn.append(np.array(RMSE_LIST).mean())
print('\nDONE')
plt.figure(figsize=(10,7))
plt.plot(neighbor_range,list_knn)
plt.title('n Selection for Age Model')
plt.ylabel('RMSE (Years)')
plt.xlabel('n Neighbors')
plt.xticks(neighbor_range)
#plt.savefig('./age_rmse.png',bbox_inches='tight',dpi=300)
plt.show()

18999
DONE

In [19]:

df2 = df.loc[:,['Pclass','Sex','SibSp','Parch','Fare','Embarked','Age']] # These are our variables of interest

# Quick and dirty: Categorical variables are replaced with an integer
df2['Sex'].replace(['female','male'],[0,1],inplace=True) 
df2['Embarked'].replace(['C','Q','S'],[0,1,2],inplace=True)

# A copy of out model inputs is made for graphing later
df3 = df2.copy()

# Remove NaNs and cast to np.arr
df2.dropna(inplace=True)
data = np.array(df2)

# Defining model inputs
X = data[:,:6]
y = data[:,6]

# The inputs have many different scales.
# They are normalized.
scaler = preprocessing.Normalizer().fit(X)
T = np.array(df3[df3['Age'].isnull()])
T = scaler.transform(T[:,:6])
X = scaler.transform(X)
model_RMSE = []
for i in range (0,1000):
    X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

    # Defining a classifier. 
    n_neighbors = 4
    knn = neighbors.KNeighborsRegressor(n_neighbors,weights='distance',algorithm='brute',n_jobs=-1)
    knn.fit(X_train,y_train)

    E = np.subtract(knn.predict(X_test),y_test)
    SE = np.power(E,2)
    MSE = np.sum(SE)/len(y_test)
    RMSE = np.sqrt(MSE)
    model_RMSE.append(RMSE)
print(f'Model RMSE: {RMSE:.1f}')

# Plot the known ages with the modeled ages for comparison of shape.
plt.figure(figsize=(8,6))
plt.hist(df['Age'],color='C0',bins=np.arange(0,80,10))
plt.hist(knn.predict(X_test),color='C1',bins=np.arange(0,80,10))
plt.legend(['Known Ages','Estimated Ages'])
plt.title('Shape Comparison')
plt.ylabel('Frequency')
plt.xlabel('Age')
#plt.savefig('./agemodeldist.png', dpi=300)
# Replace unknown ages with modeled ones.
df.iloc[nullage,5] = knn.predict(T)

plt.show()

Model RMSE: 15.6

In [20]:

bins = np.linspace(-3,3,13)
plt.figure(figsize=(8,6))
#plt.hist(stats.zscore(model_RMSE),bins=bins)
plt.hist(model_RMSE)
plt.title('Model Accuracy')
plt.ylabel('Frequency\n(n=1000)')
plt.xlabel('RMSE (years)')
stats.describe(model_RMSE)
plt.savefig('agermsefreq.png',bbox_inches='tight',dpi=300)
print(np.mean(model_RMSE))
print(np.std(model_RMSE))

14.134353617595822
0.8135773875938206

The shapes of these distributions are visually consistent. The RMSE of the fit is 13 years or about half a human generation. There is no immediate reason to reject these results.

In [21]:

df[df['Age'].isnull()]

Out[21]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked

Every passenger now has an age and a port of Embarkation. Moving on, we begin to build our model of who survived.

An empty list is defined. As I move through each variable, those that have significant correlation with the Survived variable will be appended to the list.

In [22]:

model = [] 

BY SEX¶

An exploration is made into the correlation and significance of sex on surviving the shipwreck.

In [23]:

sexes = ['male','female']
bysex = np.zeros((2,2))
for i in range(0,len(sexes)):
    bysex[i] = df[df['Sex'] == sexes[i]]['Survived'].value_counts().sort_index()
bysex

Out[23]:

array([[468., 109.],
       [ 81., 233.]])

The first row of this array are all male. The second row is all female. The first column represents the number of passengers who did not survive. The second column represents the number of passengers who did survive. Looks like males are less likely to survive than females. Let's check rates

In [24]:

males_rate = bysex[0][1]/np.sum(bysex[0])
females_rate = bysex[1][1]/np.sum(bysex[1])
print('Survival rate of:')
print('\tmales:\t\t',f'{males_rate:.3f}')
print('\tfemales:\t',f'{females_rate:.3f}')

Survival rate of:
	males:		 0.189
	females:	 0.742

Indeed, there is quite a contrast between the two rates. Let's visualize survivorship.

In [25]:

labels = ['No Survive', 'Survive']
male = bysex[0]
female = bysex[1]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width/2, male, width, label='Men')
rects2 = ax.bar(x + width/2, female, width, label='Women')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Passengers')
ax.set_title('Survival by Sex')
ax.set_xticks(x)
ax.set_ylim(0,500)
ax.set_xticklabels(labels)
ax.legend()

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

plt.show()

There are more males on board than there are females. Let's do an apples to apples comparison and visualize rate.

In [26]:

labels = ['No Survive', 'Survive']
male = np.round(bysex[0]/np.sum(bysex[0]),2)
female = np.round(bysex[1]/np.sum(bysex[1]),2)

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width/2, male, width, label='Male')
rects2 = ax.bar(x + width/2, female, width, label='Female')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Rate')
ax.set_title('Survival by Sex')
ax.set_xticks(x)
ax.set_ylim(0,1)
ax.set_xticklabels(labels)
ax.legend()

autolabel(rects1)
autolabel(rects2)

plt.show()

Now we calculate Pearson's r for being a male and surviving the disaster for inclusion in the final model.

In [27]:

df['male'] = df['Sex'] == 'male' # Create a column to indicate whether or not the passenger is male
df['male'].replace([True,False],[1,0],inplace=True) # Converting the results to a numeric value

# Define the input
X = np.array(df['male'])
y = np.array(df['Survived'])

# Calculate correlation
pear = stats.pearsonr(X,y)
print('is_male vs Survived')
print(f'\tCorrelation: {pear[0]:.2f}')
print(f'\tp-val: {pear[1]:.2e}')

is_male vs Survived
	Correlation: -0.54
	p-val: 1.41e-69

Discussion¶

There is a statistically significant indirect correlation between being a male and survival. Our model should include whether or not the passenger is male.

In [28]:

model.append('male')
model

Out[28]:

['male']

BY PORT of EMBARKATION¶

An exploration is made into the correlation and significance of port of embarkation on surviving the shipwreck. We define a list of all ports

In [29]:

ports = df['Embarked'].unique()
ports

Out[29]:

array(['S', 'C', 'Q'], dtype=object)

In [30]:

# Now we calculate and print the survival rate of passengers vs where the boarded.
byport = np.zeros((len(ports),2))
byport
for i in range(0,len(ports)):
    byport[i] = df[df['Embarked'] == ports[i]]['Survived'].value_counts().sort_index()
print('Survival rate of:')
for i in range(0,len(ports)):
    print(f'\t{ports[i]}: {byport[i][1]/np.sum(byport[i]):.2f}')

Survival rate of:
	S: 0.34
	C: 0.55
	Q: 0.39

In [31]:

# Visualizing these rates
labels = ['No Survive', 'Survive']
S = byport[0]
C = byport[1]
Q = byport[2]

x = np.arange(len(labels))  # the label locations
width = 0.2  # the width of the bars

fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width, S, width, label='Southampton')
rects2 = ax.bar(x , C, width, label='Cherbourg')
rects3 = ax.bar(x + width, Q, width, label='Queenstown')


# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Passengers')
ax.set_title('Survival by Port')
ax.set_xticks(x)
ax.set_ylim(0,500)
ax.set_xticklabels(labels)
ax.legend()

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)

plt.show()

In [32]:

# Now we do the same for the rate of surviving at each port.
labels = ['No Survive', 'Survive']
S = np.round(byport[0]/np.sum(byport[0]),2)
C = np.round(byport[1]/np.sum(byport[1]),2)
Q = np.round(byport[2]/np.sum(byport[2]),2)

x = np.arange(len(labels))  # the label locations
width = 0.2  # the width of the bars

fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width, S, width, label='Southampton')
rects2 = ax.bar(x , C, width, label='Cherbourg')
rects3 = ax.bar(x + width, Q, width, label='Queenstown')


# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('rate')
ax.set_title('Survival by Port')
ax.set_xticks(x)
ax.set_ylim(0,1)
ax.set_xticklabels(labels)
ax.legend()

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)

plt.show()

These are great pictures, but do they mean anything significant? Now I calculate the Pearson R correlation for the relationship between port and surviving.

In [33]:

portdum = pd.get_dummies(df['Embarked'])
ports = np.array(portdum.columns)
X = np.array(portdum)
for i in range(0,3):
    if stats.pearsonr(X[:,i],y)[0] < 0.05:
        print(f'Embarked {ports[i]}:')
        print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.3e}')
        print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.3e}')
        model.append(f'port_{ports[i]}')
        print()

Embarked Q:
	corr 3.650e-03
	p-value 9.134e-01

Embarked S:
	corr -1.497e-01
	p-value 7.223e-06

Discussion¶

These values suggest that embarking at Cherbourg has a statistically significant direct correlation with surviving. Southampton has a similar indirect correlation. There is no statisically significant correlation between embarking at Queenstown and survival. Our model should include whether the passenger embarked at Cherbourg or Southampton.

In [34]:

model

Out[34]:

['male', 'port_Q', 'port_S']

BY DECK¶

Each cabin number is prefixed with a letter which indicates on which deck the passenger's cabin was. The cabin of all passengers is not known so for those without a known cabin(and deck) number we will create a group.

In [35]:

df.loc[:,['Cabin','Survived']][0:5]

Out[35]:

	Cabin	Survived
0	NaN	0
1	C85	1
2	NaN	1
3	C123	1
4	NaN	0

In [36]:

# Replace NaN with '??' to indicate that the cabin number and deck for this passenger is unknown.
df['Cabin'].replace(np.nan,'??',inplace=True)

# A function to return which deck the passenger was on.
def get_deck(cabin):
    return str(cabin)[0]

df['deck'] = df['Cabin'].apply(get_deck)
decks = df['deck'].unique()
decks.sort()
print(decks)
df.loc[:,['deck','Survived']][0:5]

['?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T']

Out[36]:

	deck	Survived
0	?	0
1	C	1
2	?	1
3	C	1
4	?	0

Now we have a list of all decks. Let's look into how this variable effects survival.

In [37]:

# Initialize an array
bydeck = np.zeros((len(decks),2))

# Loop through all decks and calculate number of survivors on each.
for i in range(0,len(decks)):
    if len(df[df['deck'] == decks[i]]['Survived']) > 1: # Did anyone survive on the deck?
        bydeck[i] = df[df['deck'] == decks[i]]['Survived'].value_counts().sort_index()
    elif df[df['deck'] == decks[8]]['Survived'].value_counts().index[0] == 0: # if no one survived, return 0
        bydeck[i] = [df[df['deck'] == decks[8]]['Survived'].value_counts(),0]

print('Survival rate of:')
for i in range(0,len(decks)):
    print(f'{decks[i]} {np.sum(bydeck[i]):.2f}')

Survival rate of:
? 687.00
A 15.00
B 47.00
C 59.00
D 33.00
E 32.00
F 13.00
G 4.00
T 1.00

The printout is informative, but not as revealing as a graph. Let's draw some axes.

In [38]:

labels = ['No Survive', 'Survive']

x = np.arange(0,2)  # the label locations
width = 0.5  # the width of the bars

fig, ax = plt.subplots(figsize=(16,9))

for i in range(1,len(decks)):
    b = ax.bar(x + (i-5)*width/5, bydeck[i], width/5, label=decks[i])
    autolabel(b)

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Passengers')
ax.set_title('Survival by Deck')
ax.set_xticks(x)
#ax.set_ylim(0,505)
ax.set_xticklabels(labels)
ax.legend()

plt.show()

In [39]:

# A snippet to generate distinct colors for our bars.
import colorsys
N = 2*len(np.unique(decks))
print('colors: {}'.format(N))
rgb_list = []
HSV_tuples = []
for x in range(1,N+1):
    if (x%2 == 0):
        HSV_tuples.append(((x-1)*1.0/N, 1.0, 0.325))
    else:
        HSV_tuples.append((x*1.0/N, 1.0, 0.75))
RGB_tuples = map(lambda x: colorsys.hsv_to_rgb(*x), HSV_tuples)
for rgb in RGB_tuples:
        color = [int(255*rgb[0]),int(255*rgb[1]),int(255*rgb[2])]
        hxstr = '#'
        for c in color:
            hxstr = hxstr + '{0:0{1}X}'.format(c,2)
        rgb_list.append(hxstr)    

colors: 18

Let's also visualize survival rate per deck.

In [40]:

x = np.arange(-2.0,2.0)  # the label locations
width = 5.0  # the width of the bars

fig, ax = plt.subplots(figsize=(16,9))
rects = []
for i in range(0,len(decks)):
    b = plt.barh(i, np.round(bydeck[i,1]/np.sum(bydeck[i]),2), width/5, label=decks[i],color=rgb_list[2*i])
    nb = plt.barh(i, -np.round(bydeck[i,0]/np.sum(bydeck[i]),2), width/5, label=decks[i],color=rgb_list[2*i+1])
    plt.annotate(f'N = {str(int(np.sum(bydeck[i])))}',xy=[-0.20,i-0.1],c='white')
    rects.append(b)
ax.set_xlabel('\nDeath                                                        Survival\nRate')
ax.set_ylabel('Deck')
ax.set_title('Survival by Deck')
ax.set_yticks(range(0,9))
ax.set_xlim(-1.1,1.1)
ax.set_yticklabels(decks)
ax.grid(axis='x')
plt.show()

It looks like if the deck is known, seems to indicate that the passenger survived. In fact, I would say that if the deck is unknown there is a significant chance the passenger did not survive. To validate my claim, the Pearson r is calculated.

In [41]:

# Translate the categorical deck values into numeric types
deckdum = pd.get_dummies(df['deck'])
decks = np.array(deckdum.columns)

#Define input (y is still survivors)
X = np.array(deckdum)

# Loop through decks and check for correlation
print('Significance  p < 0.05')
print('==========================')
for i in range(0,len(decks)):
    pr = stats.pearsonr(X[:,i],y)
    significant = ( pr[1] < 0.05)
    if significant: # if a significant correlation is found, do stuff
        print(f'Deck {decks[i]} ')
        print(f'\tcorr {pr[0]:.3f}')
        print(f'\tp-value {pr[1]:.3e}')
        print(f'\tSignificant: {significant}')
        model.append(f'deck_{decks[i]}')
        print('+------------------------+')

Significance  p < 0.05
==========================
Deck ? 
	corr -0.317
	p-value 3.091e-22
	Significant: True
+------------------------+
Deck B 
	corr 0.175
	p-value 1.442e-07
	Significant: True
+------------------------+
Deck C 
	corr 0.115
	p-value 6.062e-04
	Significant: True
+------------------------+
Deck D 
	corr 0.151
	p-value 6.233e-06
	Significant: True
+------------------------+
Deck E 
	corr 0.145
	p-value 1.332e-05
	Significant: True
+------------------------+

Discussion¶

There is a statistically significant correlation between staying on Deck B, C, D, E and survival. Conversely, if the deck of the passenger is unknown there is a stronger correlation with not surviving. There is no significant correlation for the other decks A, F, G, & T. The significant decks should be included in our model.

In [42]:

model

Out[42]:

['male', 'port_Q', 'port_S', 'deck_?', 'deck_B', 'deck_C', 'deck_D', 'deck_E']

BY AGE¶

Since we filled in the missing age values, everyone on board now has an age value. Let's visualize the age distribution and investigate different age groups for correlation with surviving the disaster.

Let's make a histogram.

In [43]:

# Defining some parameters for our histogram
nbins = 20 # The number of bins in our histogram. More on this later.
ylim = df['Age'].max() # Maximum value for the histogram's x-axis
binwidth= ylim/nbins # The width of each bin

bins = np.arange(0,ylim,binwidth) # An array of evenly spaced bins "binwidth" apart from 0 to 80 years of age.


thedigital = np.digitize(df['Age'],bins,right=True) # assign each age to a bin
thecount = np.bincount(thedigital) # Count the members in each bin

# Create some axes
plt.figure(figsize=(8,6))
plt.bar(bins+binwidth/2,thecount[1:],width=binwidth*0.95)
plt.xticks(np.arange(0,88,8))
plt.title('Passenger Age Distibution')
plt.ylabel('Frequency')
plt.xlabel('Age (years)')
plt.savefig('./agedist.png',bbox_inches='tight',dpi=300)
plt.show()

Let's calculate some descriptive statistics for discussion.

In [44]:

desc = stats.describe(df['Age'])

stats_ = ['nobs\t','minmax\t','mean\t','variance','skewness','kurtosis']
for i in range(0,len(stats_)):
    print(f'{stats_[i]}:\t{desc[i]}')

mode = stats.mode(df['Age'])
print(f'\nmode\t:\t{mode[0][0]} ({mode[1][0]})')

median = np.median(np.array(df['Age']))
print(f'median\t:\t{median:.1f}')

stdev= np.std(df['Age'])
print(f'Std.Dev\t:\t{stdev:.1f}')

IQR = stats.iqr(df['Age'])
_25pct = desc[2] - IQR/2
_50pct = desc[2]
_75pct = desc[2] + IQR/2
print(f'\nIQR\t:\t{IQR:.1f}')
print(f'25%\t:\t{_25pct:.1f}')
print(f'50%\t:\t{_50pct:.1f}')
print(f'75%\t:\t{_75pct:.1f}')

nobs	:	891
minmax	:	(0.42, 80.0)
mean	:	29.561181143308534
variance:	193.48469069039407
skewness:	0.4117141257971412
kurtosis:	0.2923446392726867

mode	:	19.0 (30)
median	:	28.0
Std.Dev	:	13.9

IQR	:	16.6
25%	:	21.2
50%	:	29.6
75%	:	37.9

This age distribution is bi-modal and skewed right with an IQR of 16.3. The center of the distribution lies between the mode (19) and the mean (30).

Let's now visualize survival by age-group.

In [45]:

dig_sur = np.digitize(df[df['Survived'] == 1]['Age'],bins,right=True)
freq_sur = np.bincount(dig_sur,minlength=len(bins)+1)

dig_die = np.digitize(df[df['Survived'] == 0]['Age'],bins,right=True)
freq_die = np.bincount(dig_die,minlength=len(bins)+1)

xmax = freq_die.max()+5
xlim = [-xmax,xmax]

plt.figure(figsize=(16,9))
plt.barh(bins+binwidth/2,freq_sur[1:],height=binwidth*0.9)
plt.barh(bins+binwidth/2,-freq_die[1:],height=binwidth*0.9)
plt.xlim(xlim)
plt.yticks(np.arange(0,88,8))
plt.title('Age groups on the RMS Titanic')
plt.ylabel('Age')
plt.xlabel('Lost                                     Saved\nPassengers')
plt.savefig('agegroups.png',bbox_inches='tight',dpi=300)
plt.show()

There are three groups that seem to have a correlation with surviving or not surviving. First, those between the ages of 0 and 7 seem to survive more often than not. This could be the result that young people take up less room on lifeboats and they're pretty cute.

Second, is the group of people age 20-27. Their fate seems to be tragic. This age group appears to correlate with not surviving. It could be that these people are full-grown adults but may not have a family of their own yet and therefore a liability. I suspect that males in this age group have sour odds.

Finally, the age group 36-44 also seems to correlate with not surviving. Perhaps this group of people has already raised their children who are now grown and independent. Obsolescence.

Let's check the pearson scores against my interpretation of this graphic.

In [46]:

# Initialize an array to hold which age group each passenger belongs to.
age_arr = np.zeros((len(df),len(bins)+1))

# Fill the array with the age group indicator for each passenger
for i in range(0,len(df)):
    for j in range(0,len(bins)):
        if (df['Age'][i] > bins[j]) & (df['Age'][i] <= bins[j]+binwidth):
            age_arr[i,j] = 1
            break
            
# Let's have a look 
print(bins.astype(int))
print(age_arr[0:10].astype(int))

[ 0  4  8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76]
[[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

In [47]:

bins = np.arange(0,ylim,binwidth)
X = age_arr
total = 0
for i in range(0,age_arr.shape[1]):
    significant = False
    if stats.pearsonr(X[:,i],y)[1] < 0.05:
        significant = True
        print(f'i = {i}')
        print(f'Age {bins[i]:.1f} : {bins[i]+binwidth:.1f} ')
        print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.3f}')
        print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.3e}')
        print(f'\tSignificant: {significant}')
        print(f'\tPassengers: {np.sum(X[:,i])}')
        model.append(f'{bins[i]:.0f}_{bins[i]+binwidth:.0f}')
        total = total + np.sum(X[:,i])
        print('+------------------------+')       

print(total)

i = 0
Age 0.0 : 4.0 
	corr 0.130
	p-value 1.021e-04
	Significant: True
	Passengers: 40.0
+------------------------+
i = 6
Age 24.0 : 28.0 
	corr -0.084
	p-value 1.198e-02
	Significant: True
	Passengers: 127.0
+------------------------+
167.0

/usr/pkg/lib/python3.8/site-packages/scipy/stats/stats.py:3845: PearsonRConstantInputWarning: An input array is constant; the correlation coefficent is not defined.
  warnings.warn(PearsonRConstantInputWarning())

Discussion¶

There are two age groups that have a statistically significant correlation with survival on the RMS Titanic. Those age 0 - 9 are correlated with surviving. Those age 18-27 are correlated with not surviving. There are no other age groups that have a correlation at this confidence level.

In [48]:

model

Out[48]:

['male',
 'port_Q',
 'port_S',
 'deck_?',
 'deck_B',
 'deck_C',
 'deck_D',
 'deck_E',
 '0_4',
 '24_28']

BY TICKET PREFIX¶

Some tickets have a prefix. Maybe they mean something that could have an effect on whether or not the passenger survived. Let's peel off the prefixes and take a look.

In [49]:

def get_prefix(PD_SERIES):
    splt = PD_SERIES.split(' ')
    tmp = splt
    if len(splt) > 1:      
        tmp = splt[0]
        tmp = tmp.replace('STON/O2','SOTON/O2')
        tmp = tmp.replace('STON/O','SOTON/O2')
        tmp = tmp.replace('O2.','O2')

        tmp = tmp.upper()
    else:
        tmp = ''
    return tmp

In [50]:

df['Ticket'].apply(get_prefix).value_counts()

Out[50]:

              665
PC             60
C.A.           27
SOTON/O2       20
A/5            10
SC/PARIS        9
W./C.           9
SOTON/O.Q.      8
CA.             8
SOTON/OQ        7
A/5.            7
CA              6
C               5
S.O.C.          5
F.C.C.          5
PP              3
A/4             3
SC/AH           3
A/4.            3
S.O./P.P.       3
A.5.            2
S.C./PARIS      2
P/PP            2
WE/P            2
A./5.           2
C.A./SOTON      1
FA              1
F.C.            1
A4.             1
W/C             1
SO/C            1
A/S             1
S.O.P.          1
W.E.P.          1
S.C./A.4.       1
SC              1
S.W./PP         1
S.P.            1
SW/PP           1
SCO/W           1
Name: Ticket, dtype: int64

Discussion¶

This is a waste of time. There aren't enough passengers with any specific prefix to make any meaningful inference about the group.

BY TITLE¶

Each passenger has a title preceding his or her name. Perhaps the passenger's title is correlated with survival. We did find correlation for sex and age. This has an air of linear dependence, but let's look anyway.

In [51]:

# Peel off and count titles
df['title'] = df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].replace(' ',''))
df['title'].value_counts()

Out[51]:

Mr             517
Miss           182
Mrs            125
Master          40
Dr               7
Rev              6
Col              2
Mlle             2
Major            2
Sir              1
Mme              1
theCountess      1
Jonkheer         1
Ms               1
Capt             1
Don              1
Lady             1
Name: title, dtype: int64

In [52]:

# Group up some of the less common titles
miss = ['Ms','Mlle']
mrs = ['Mme', 'Lady']
prestigious = ['Dr','Rev','Major','Col','Sir','theCountess','Capt','Jonkheer','Don']

df['title'] = df['title'].replace(miss,'Miss')
df['title'] = df['title'].replace(mrs,'Mrs')
df['title'] = df['title'].replace(prestigious,'Prestigious')
df['title'].value_counts()

Out[52]:

Mr             517
Miss           185
Mrs            127
Master          40
Prestigious     22
Name: title, dtype: int64

In [53]:

# Creating an array to hold the titles and whether or not they surived.
titles = df['title'].value_counts().index.tolist()
bytitle = np.zeros((len(titles),2))
# Populate the array
for i in range(0,len(titles)):
    bytitle[i] = df[df['title'] == titles[i]]['Survived'].value_counts().sort_index()

### Visualizing survival vs title

#Define some parameters for the graph
x = np.arange(-2.0,2.0)  # the label locations
width = 5.0  # the width of the bars

# Create some axes
fig, ax = plt.subplots(figsize=(16,9))
rects = []
# PLOT!
for i in range(0,len(titles)):
    b = plt.barh(i, np.round(bytitle[i,1]/np.sum(bytitle[i]),2), width/5, label=titles[i],color=rgb_list[2*i])
    nb = plt.barh(i, -np.round(bytitle[i,0]/np.sum(bytitle[i]),2), width/5, label=titles[i],color=rgb_list[2*i+1])
    rects.append(b)
    plt.annotate(f'N = {str(int(np.sum(bytitle[i])))}',xy=[-0.18,i-0.1],c='white')
ax.set_xlabel('\nNo Survival                                                        Survival\nRate')
ax.set_ylabel('Title')
ax.set_title('Survival by Title')
ax.set_yticks(range(0,len(titles)))
ax.set_xlim(-1.1,1.1)
ax.set_yticklabels(titles)
ax.grid(axis='x')
plt.show()

It appears that there is no correlation with having a prestigious title and surviving or not. Also it appears that having any title except Mr. is correlated with surviving. Mr is strongly correlated with drowning/hypothermia. Let's calculate the pearson R for confirmation.

In [54]:

# Prepare an array to hold our data
tit_arr = np.zeros((len(df),len(titles)))

# Populate array with title indicators
for i in range(0,len(df)):
    for j in range(0,len(titles)):
        if df['title'][i] == titles[j]:
            tit_arr[i,j] = 1
            break
tit_arr[0:5]

Out[54]:

array([[1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.]])

In [55]:

# Here I loop through the title array.
# Each title is checked for correlation with the Survived variable.

X = tit_arr
total = 0
for i in range(0,tit_arr.shape[1]):
    significant = False
    if stats.pearsonr(X[:,i],y)[1] < 0.05:
        significant = True
        print(f'i = {i}')
        print(f'Title {titles[i]} ')
        print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.2f}')
        print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.1e}')
        print(f'\tSignificant: {significant}')
        print(f'\tPassengers: {np.sum(X[:,i])}')
        model.append(f'{titles[i]}')
        total = total + np.sum(X[:,i])
        print('+------------------------+')       

print(total)

i = 0
Title Mr 
	corr -0.55
	p-value 2.4e-71
	Significant: True
	Passengers: 517.0
+------------------------+
i = 1
Title Miss 
	corr 0.34
	p-value 6.7e-25
	Significant: True
	Passengers: 185.0
+------------------------+
i = 2
Title Mrs 
	corr 0.34
	p-value 2.7e-26
	Significant: True
	Passengers: 127.0
+------------------------+
i = 3
Title Master 
	corr 0.09
	p-value 1.1e-02
	Significant: True
	Passengers: 40.0
+------------------------+
869.0

In [56]:

model

Out[56]:

['male',
 'port_Q',
 'port_S',
 'deck_?',
 'deck_B',
 'deck_C',
 'deck_D',
 'deck_E',
 '0_4',
 '24_28',
 'Mr',
 'Miss',
 'Mrs',
 'Master']

BY SIBSP¶

In [57]:

df['SibSp'].value_counts().sort_index().tolist()

Out[57]:

[608, 209, 28, 16, 18, 5, 7]

In [58]:

sbsps = np.array(df['SibSp'].value_counts().sort_index().index)
bysbsp = np.zeros([len(sbsps),2])
for i in range(0,len(sbsps)):
    bysbsp[i] = df[df['SibSp'] == sbsps[i]]['Survived'].value_counts().sort_index()

In [59]:

print('Survival rate of SibSp')
for i in range(0,len(sbsps)):
    print(f'\t = {sbsps[i]}: {bysbsp[i][1]/np.sum(bysbsp[i]):.2f}')

Survival rate of SibSp
	 = 0: 0.35
	 = 1: 0.54
	 = 2: 0.46
	 = 3: 0.25
	 = 4: 0.17
	 = 5: 0.50
	 = 8: 0.50

In [60]:

ssp_arr = np.zeros((len(df),len(sbsps)))
for i in range(0,len(df)):
    for j in range(0,len(sbsps)):
        if df['SibSp'][i] == sbsps[j]:
            ssp_arr[i,j] = 1
            break
ssp_arr[0:5]

Out[60]:

array([[0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.]])

In [61]:

X = ssp_arr
total = 0
for i in range(0,ssp_arr.shape[1]):
    significant = False
    if stats.pearsonr(X[:,i],y)[1] < 0.05:
        significant = True
        print(f'i = {i}')
        print(f'SibSp: {sbsps[i]} ')
        print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.2f}')
        print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.1e}')
        print(f'\tSignificant: {significant}')
        print(f'\tPassengers: {np.sum(X[:,i])}')
        model.append(f'sbsp_{sbsps[i]}')
        total = total + np.sum(X[:,i])
        print('+------------------------+')       

print(total)

i = 0
SibSp: 0 
	corr -0.12
	p-value 5.3e-04
	Significant: True
	Passengers: 608.0
+------------------------+
i = 1
SibSp: 1 
	corr 0.17
	p-value 2.0e-07
	Significant: True
	Passengers: 209.0
+------------------------+
i = 6
SibSp: 8 
	corr -0.07
	p-value 3.6e-02
	Significant: True
	Passengers: 7.0
+------------------------+
824.0

In [62]:

model

Out[62]:

['male',
 'port_Q',
 'port_S',
 'deck_?',
 'deck_B',
 'deck_C',
 'deck_D',
 'deck_E',
 '0_4',
 '24_28',
 'Mr',
 'Miss',
 'Mrs',
 'Master',
 'sbsp_0',
 'sbsp_1',
 'sbsp_8']

BY PARCH¶

In [63]:

parchs = np.array(df['Parch'].value_counts().sort_index().index)
byparch = np.zeros([len(parchs),2])
for i in range(0,len(parchs)):
    byparch[i] = df[df['Parch'] == parchs[i]]['Survived'].value_counts().sort_index()

In [64]:

print('Survival rate of: Parch')
for i in range(0,len(parchs)):
    print(f'\t = {parchs[i]}: {byparch[i][1]/np.sum(byparch[i]):.2f}')

Survival rate of: Parch
	 = 0: 0.34
	 = 1: 0.55
	 = 2: 0.50
	 = 3: 0.60
	 = 4: 0.50
	 = 5: 0.20
	 = 6: 0.50

In [65]:

pch_arr = np.zeros((len(df),len(parchs)))
for i in range(0,len(df)):
    for j in range(0,len(parchs)):
        if df['Parch'][i] == parchs[j]:
            pch_arr[i,j] = 1
            break
pch_arr[0:5]

Out[65]:

array([[1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0.]])

In [66]:

X = pch_arr
total = 0
for i in range(0,pch_arr.shape[1]):
    significant = False
    if stats.pearsonr(X[:,i],y)[1] < 0.05:
        significant = True
        print(f'i = {i}')
        print(f'Parch: {parchs[i]} ')
        print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.2f}')
        print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.1e}')
        print(f'\tSignificant: {significant}')
        print(f'\tPassengers: {np.sum(X[:,i])}')
        model.append(f'pa_{parchs[i]}')
        total = total + np.sum(X[:,i])
        print('+------------------------+')       

print(total)

i = 0
Parch: 0 
	corr -0.15
	p-value 1.0e-05
	Significant: True
	Passengers: 678.0
+------------------------+
i = 1
Parch: 1 
	corr 0.13
	p-value 5.9e-05
	Significant: True
	Passengers: 118.0
+------------------------+
i = 2
Parch: 2 
	corr 0.08
	p-value 2.5e-02
	Significant: True
	Passengers: 80.0
+------------------------+
876.0

BY PCLASS¶

In [67]:

pcls = np.array(df['Pclass'].value_counts().sort_index().index)
pcls

Out[67]:

array([1, 2, 3])

In [68]:

pcls_arr = np.zeros((len(df),len(pcls)))
for i in range(0,len(df)):
    for j in range(0,len(pcls)):
        if df['Pclass'][i] == pcls[j]:
            pcls_arr[i,j] = 1
            break
pcls_arr[0:5]

Out[68]:

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [69]:

X = pcls_arr
total = 0
for i in range(0,pcls_arr.shape[1]):
    significant = False
    if stats.pearsonr(X[:,i],y)[1] < 0.05:
        significant = True
        print(f'i = {i}')
        print(f'Pclass: {pcls[i]} ')
        print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.2f}')
        print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.1e}')
        print(f'\tSignificant: {significant}')
        print(f'\tPassengers: {np.sum(X[:,i])}')
        model.append(f'class_{pcls[i]}')
        total = total + np.sum(X[:,i])
        print('+------------------------+')       

print(total)

i = 0
Pclass: 1 
	corr 0.29
	p-value 3.2e-18
	Significant: True
	Passengers: 216.0
+------------------------+
i = 1
Pclass: 2 
	corr 0.09
	p-value 5.3e-03
	Significant: True
	Passengers: 184.0
+------------------------+
i = 2
Pclass: 3 
	corr -0.32
	p-value 5.5e-23
	Significant: True
	Passengers: 491.0
+------------------------+
891.0

Modeling¶

Below I use the information gleaned from exploration to build a model of who survives the disaster based upon variables considered to be most signicant. A function is defined to create an array of 0s and 1s from a dataframe which indicate whether or not the passenger has this particular feature. The array is then used to fit a logistic regression model.

In [70]:

model

Out[70]:

['male',
 'port_Q',
 'port_S',
 'deck_?',
 'deck_B',
 'deck_C',
 'deck_D',
 'deck_E',
 '0_4',
 '24_28',
 'Mr',
 'Miss',
 'Mrs',
 'Master',
 'sbsp_0',
 'sbsp_1',
 'sbsp_8',
 'pa_0',
 'pa_1',
 'pa_2',
 'class_1',
 'class_2',
 'class_3']

In [71]:

# View the format of the model array
pd.DataFrame(columns=model)

Out[71]:

	male	port_Q	port_S	deck_?	deck_B	deck_C	deck_D	deck_E	0_4	24_28	...	Master	sbsp_0	sbsp_1	sbsp_8	pa_0	pa_1	pa_2	class_1	class_2	class_3

0 rows × 23 columns

In [72]:

def initialize_model(dataframe):
    model_arr = np.zeros((len(dataframe),len(model)),dtype=int)
    for i in range(0,len(dataframe)):
        
        ### SEX
        model_arr[i,0] = (dataframe['Sex'][i] == 'male')        
        
        ### PORT
        #if dataframe['Embarked'][i] == 'Q':
        #    model_arr[i,1] = 1
        if dataframe['Embarked'][i] == 'S':
            model_arr[i,2] = 1    
        
        ### DECK
        dataframe['Cabin'].replace(np.nan,'??',inplace=True)
        dataframe['deck'] = dataframe['Cabin'].apply(get_deck)
        if dataframe['deck'][i] == '?':
            model_arr[i,3] = 1
        if dataframe['deck'][i] == 'B':
            model_arr[i,4] = 1
        if dataframe['deck'][i] == 'C':
            model_arr[i,5] = 1
        if dataframe['deck'][i] == 'D':
            model_arr[i,6] = 1
        if dataframe['deck'][i] == 'E':
            model_arr[i,7] = 1
        

        
        ### TITLE
        miss = ['Ms','Mlle']
        mrs = ['Mme', 'Lady']
        dataframe['title'] = dataframe['Name'].apply(lambda x: x.split(',')[1].split('.')[0].replace(' ',''))
        dataframe['title'] = dataframe['title'].replace(miss,'Miss')
        dataframe['title'] = dataframe['title'].replace(mrs,'Mrs')
        if dataframe['title'][i] == 'Mr':
            model_arr[i,8] = 1
        if dataframe['title'][i] == 'Miss':
            model_arr[i,9] = 1
        if dataframe['title'][i] == 'Mrs':
            model_arr[i,10] = 1
        #if dataframe['title'][i] == 'Master':
        #    model_arr[i,11] = 1
        
        ### SIBSP
        if dataframe['SibSp'][i] == 0:
            model_arr[i,12] = 1        
        if dataframe['SibSp'][i] == 1:
            model_arr[i,13] = 1   
        #if dataframe['SibSp'][i] == 8:
        #   model_arr[i,14] = 1
        
        ### PARCH
        if dataframe['Parch'][i] == 0:
            model_arr[i,15] = 1                       
        if dataframe['Parch'][i] == 1:
            model_arr[i,16] = 1         
        #if dataframe['Parch'][i] == 2:
        #    model_arr[i,17] = 1
        
        ### PCLASS
        if dataframe['Pclass'][i] == 1:
            model_arr[i,18] = 1
        #if dataframe['Pclass'][i] == 2:
        #    model_arr[i,19] = 1             
        if dataframe['Pclass'][i] == 3:
            model_arr[i,20] = 1 
            
        ### AGE
        if dataframe['Age'].isnull().any():
            nullage = dataframe[dataframe['Age'].isnull()].index.tolist() # Bookmark indices of NaNs
            df2 = dataframe.loc[:,['Pclass','Sex','SibSp','Parch','Fare','Embarked','Age']]

            df2['Sex'].replace(['female','male'],[0,1],inplace=True)
            df2['Embarked'].replace(['C','Q','S'],[0,1,2],inplace=True)

            X = np.array(df2[df2['Age'].isnull()])
            X = scaler.transform(X[:,:6])
            dataframe.iloc[nullage,5] = knn.predict(X) # Use the same model we built earlier to estimate age
      
        if (dataframe['Age'][i] > 0) & (df['Age'][i] <= 4):
            model_arr[i,21] = 1
            
        #if (dataframe['Age'][i] > 32) & (df['Age'][i] <= 36):
        #    model_arr[i,22] = 1
        
    return model_arr            

In [73]:

# Load the dataframe and initialize the model array
df_train = pd.read_csv('train.csv')
train_arr = initialize_model(df_train)
print(train_arr[0:5])

[[1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0]
 [0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0]
 [0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0]
 [1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0]]

In [78]:

# Define the model input
X = train_arr
y = np.array(df_train['Survived'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)


# define the model
clf_log = LogisticRegression(multi_class='multinomial',class_weight=None,solver='newton-cg',max_iter=100)
clf_log.fit(X_train, y_train)

# Score the model
print('Training Set')
print(f'\tf1-score: {f1_score(y_test,clf_log.predict(X_test)):.2f}')
print(f'\tacc. score: {accuracy_score(y_test,clf_log.predict(X_test)):.2f}\n')

# Load the competition set
df_test = pd.read_csv('test.csv')
df_true = pd.read_csv('gender_submission.csv')

# Initialize the competition data
test_arr = initialize_model(df_test)

# Predict survival of competition data and score
print('Competition Set')
print(f'\tAccuracy: {accuracy_score(df_true.iloc[:,1],clf_log.predict(test_arr)):.3f}')
print(f'\tf1-score: {f1_score(df_true.iloc[:,1],clf_log.predict(test_arr)):.3f}')


# Make a pretty picture
c_mat = confusion_matrix(df_true.iloc[:,1],clf_log.predict(test_arr))
disp = ConfusionMatrixDisplay(confusion_matrix = c_mat, display_labels=['No Survive','Survive'])
disp.plot(values_format='d')
plt.savefig('confusion.png',bbox_inches='tight',dpi=300)

Training Set
	f1-score: 0.74
	acc. score: 0.80

Competition Set
	Accuracy: 0.897
	f1-score: 0.855

In [74]:

# Prepare a dataframe to submit to Kaggle.com
df_submit = pd.DataFrame({'PassengerId':df_test['PassengerId'],'Survived':clf_log.predict(test_arr).tolist()})
df_submit.set_index('PassengerId',inplace=True)
df_submit.to_csv('kagglesubmission.csv')
df_submit[0:5]

Out[74]:

	Survived
PassengerId
892	0
893	1
894	0
895	0
896	1

Rankings¶

Let's see how I did in the rankings.

In [80]:

# Load the leaderboard and print my rank.
df_leaderboard = pd.read_csv('./leaderboard/titanic-publicleaderboard.csv')
df_leaderboard[df_leaderboard['TeamName'] == 'Joey #5']

Out[80]:

	TeamId	TeamName	SubmissionDate	Score
5065	5676964	Joey #5	2021-01-13 16:48:41	0.78468

Now visualize with a histogram.

In [82]:

# Define some parameters
nbins = 65 # Number of bins
ylim = 0.855 # Maximum value for the histogram's y-axis
ymin = 0.7 # Minimum value for the histogram's y-axis
binwidth= (ylim-ymin)/nbins # The width of each bin
bins = np.arange(0.7,ylim,binwidth) # An array of evenly spaced bins "binwidth" apart

# Prepare data
thedigital = np.digitize(df_leaderboard['Score'],bins,right=True) # assign each age to a bin
thecount = np.bincount(thedigital) # Count the members in each bin

# Create some axes
plt.figure(figsize=(8,6))

# Draw the bars
plt.bar(bins[0:35]+binwidth/2,thecount[1:36],width=binwidth,color = 'C0')
plt.bar(bins[35:36]+binwidth/2,thecount[36:37],width=binwidth,color = 'C1')
plt.bar(bins[36:]+binwidth/2,thecount[37:],width=binwidth,color = 'C0')

# Label
plt.xticklabels=bins[33:66]
plt.xlim=(0.7,0.85)
plt.title('Competition Rankings')
plt.ylabel('Frequency')
plt.xlabel('Kaggle Score')

# Save and view
plt.savefig('./kaggle.png',bbox_inches='tight',dpi=300)
plt.show()

Pretty good for first attempt at this competition.