In this notebook various data from the passengers aboard the RMS is loaded. This data is then analyzed for significant correlation with surviving the disaster. Ultimately, models are produced to predict the survivorship of passengers based on details of their passage.
Let's import the libraries we'll need.
import pandas as pd
import numpy as np
import os
from scipy import stats
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn import neighbors
from sklearn import preprocessing
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.rcParams.update({'font.size': 16})
The data is imported and a several rows are printed.
df = pd.read_csv('train.csv')
print(df.shape)
no_passengers = df.shape[0]
print(df.dtypes)
df[0:10]
(891, 12) PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
Three columns have NaN. Let's address the port of Embarkation as well as age.
len(df[df['Cabin'].isnull()])/len(df)
0.7710437710437711
We inspect the Embarked column for missing values.
df[df['Embarked'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
Two values are found. Let's build a model based on everyone else on board and see if we can guess where Miss Icard and Mrs. Stone got onboard. We assume that the passenger's fare, class, and where they embarked are related. Let's define the variables.
df2 = df.loc[:,['Pclass','Fare','Embarked']]
df2.dropna(inplace=True) #Dropping the target of our model
# Prepare X
X = np.array(df2.loc[:,['Pclass','Fare']]) # Casting to an np.arr
scaler = preprocessing.Normalizer().fit(X)
X = scaler.transform(X) # Scaling values
# Prepare y
ports = ['C','Q','S']
emb_dum = df2['Embarked'].replace(ports,[0,1,2]) # Transforming categorical data into numeric
y = np.array(emb_dum) # Casting to an np.arr
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15)
A classifier clf is defined and the data fit to it.
fig = plt.figure(figsize=(25,20))
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')
<Figure size 1800x1440 with 0 Axes>
f1_score(y_test,clf.predict(X_test),average='weighted')
0.9106667918270284
plt.figure(figsize=(8,4))
c_mat = confusion_matrix(y_test,clf.predict(X_test))
disp = ConfusionMatrixDisplay(confusion_matrix = c_mat, display_labels=ports)
disp.plot(values_format='d')
plt.savefig('./embarked_dt.png',bbox_inches='tight',dpi=300)
<Figure size 576x288 with 0 Axes>
dot_data = tree.export_graphviz(clf, out_file='embarked_dt.dot',
feature_names=['Pclass','Fare'],
filled=True, rounded=True,
special_characters=True)
graph = pgv.AGraph('./embarked_dt.dot')
#graph.draw()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-10-3b4ff03972e8> in <module> 3 filled=True, rounded=True, 4 special_characters=True) ----> 5 graph = pgv.AGraph('./embarked_dt.dot') 6 #graph.draw() NameError: name 'pgv' is not defined
Predicting the port of our passengers:
x = scaler.transform(np.array([1,80]).reshape(1,-1)) # Both passengers paid 80 pounds for a first class ticket
print('[0,1,2]')
print(ports)
print(clf.predict(x))
[0,1,2] ['C', 'Q', 'S'] [2]
The model calculates the port of embarkation for both passengers to be Southampton. Southampton is imputed.
df['Embarked'].fillna('S', inplace=True)
Let's check to see if we missed anyone
df[df['Embarked'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
We did not miss anyone. Let's now look towards age.
df[df['Age'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
859 | 860 | 0 | 3 | Razi, Mr. Raihed | male | NaN | 0 | 0 | 2629 | 7.2292 | NaN | C |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
177 rows × 12 columns
Oh dear. There is a significant number of passengers whose age is unknown. Again we will build a model with the objective of calculating the age of passengers we don't know. This time a k-Neighbors regressor is used and fit to the passengers we have data for. First let's note the indicies of the passengers we will be targeting.
nullage = df[df['Age'].isnull()].index.tolist()
nullage[0:5]
[5, 17, 19, 26, 28]
matplotlib.rcParams.update({'font.size': 20})
df2 = df.loc[:,['Pclass','Sex','SibSp','Parch','Fare','Embarked','Age']] # These are our variables of interest
# Quick and dirty: Categorical variables are replaced with an integer
df2['Sex'].replace(['female','male'],[0,1],inplace=True)
df2['Embarked'].replace(['C','Q','S'],[0,1,2],inplace=True)
# A copy of out model inputs is made for graphing later
df3 = df2.copy()
# Remove NaNs and cast to np.arr
df2.dropna(inplace=True)
data = np.array(df2)
# Defining model inputs
X = data[:,:6]
y = data[:,6]
list_knn = []
its=1000
neighbor_range = range(1,20)
index = 0
for i in neighbor_range:
RMSE_LIST= []
for j in range(0,its):
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
# Defining a classifier.
n_neighbors = i
knn = neighbors.KNeighborsRegressor(n_neighbors,weights='distance',algorithm='brute',n_jobs=-1)
knn.fit(X_train,y_train)
E = np.subtract(knn.predict(X_test),y_test)
SE = np.power(E,2)
MSE = np.sum(SE/len(y_test))
RMSE = np.sqrt(MSE)
RMSE_LIST.append(RMSE)
os.sys.stdout.write(f'{index}\r')
index = index + 1
list_knn.append(np.array(RMSE_LIST).mean())
print('\nDONE')
plt.figure(figsize=(10,7))
plt.plot(neighbor_range,list_knn)
plt.title('n Selection for Age Model')
plt.ylabel('RMSE (Years)')
plt.xlabel('n Neighbors')
plt.xticks(neighbor_range)
#plt.savefig('./age_rmse.png',bbox_inches='tight',dpi=300)
plt.show()
18999 DONE
df2 = df.loc[:,['Pclass','Sex','SibSp','Parch','Fare','Embarked','Age']] # These are our variables of interest
# Quick and dirty: Categorical variables are replaced with an integer
df2['Sex'].replace(['female','male'],[0,1],inplace=True)
df2['Embarked'].replace(['C','Q','S'],[0,1,2],inplace=True)
# A copy of out model inputs is made for graphing later
df3 = df2.copy()
# Remove NaNs and cast to np.arr
df2.dropna(inplace=True)
data = np.array(df2)
# Defining model inputs
X = data[:,:6]
y = data[:,6]
# The inputs have many different scales.
# They are normalized.
scaler = preprocessing.Normalizer().fit(X)
T = np.array(df3[df3['Age'].isnull()])
T = scaler.transform(T[:,:6])
X = scaler.transform(X)
model_RMSE = []
for i in range (0,1000):
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
# Defining a classifier.
n_neighbors = 4
knn = neighbors.KNeighborsRegressor(n_neighbors,weights='distance',algorithm='brute',n_jobs=-1)
knn.fit(X_train,y_train)
E = np.subtract(knn.predict(X_test),y_test)
SE = np.power(E,2)
MSE = np.sum(SE)/len(y_test)
RMSE = np.sqrt(MSE)
model_RMSE.append(RMSE)
print(f'Model RMSE: {RMSE:.1f}')
# Plot the known ages with the modeled ages for comparison of shape.
plt.figure(figsize=(8,6))
plt.hist(df['Age'],color='C0',bins=np.arange(0,80,10))
plt.hist(knn.predict(X_test),color='C1',bins=np.arange(0,80,10))
plt.legend(['Known Ages','Estimated Ages'])
plt.title('Shape Comparison')
plt.ylabel('Frequency')
plt.xlabel('Age')
#plt.savefig('./agemodeldist.png', dpi=300)
# Replace unknown ages with modeled ones.
df.iloc[nullage,5] = knn.predict(T)
plt.show()
Model RMSE: 15.6
bins = np.linspace(-3,3,13)
plt.figure(figsize=(8,6))
#plt.hist(stats.zscore(model_RMSE),bins=bins)
plt.hist(model_RMSE)
plt.title('Model Accuracy')
plt.ylabel('Frequency\n(n=1000)')
plt.xlabel('RMSE (years)')
stats.describe(model_RMSE)
plt.savefig('agermsefreq.png',bbox_inches='tight',dpi=300)
print(np.mean(model_RMSE))
print(np.std(model_RMSE))
14.134353617595822 0.8135773875938206
The shapes of these distributions are visually consistent. The RMSE of the fit is 13 years or about half a human generation. There is no immediate reason to reject these results.
df[df['Age'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
Every passenger now has an age and a port of Embarkation. Moving on, we begin to build our model of who survived.
An empty list is defined. As I move through each variable, those that have significant correlation with the Survived variable will be appended to the list.
model = []
An exploration is made into the correlation and significance of sex on surviving the shipwreck.
sexes = ['male','female']
bysex = np.zeros((2,2))
for i in range(0,len(sexes)):
bysex[i] = df[df['Sex'] == sexes[i]]['Survived'].value_counts().sort_index()
bysex
array([[468., 109.], [ 81., 233.]])
The first row of this array are all male. The second row is all female. The first column represents the number of passengers who did not survive. The second column represents the number of passengers who did survive. Looks like males are less likely to survive than females. Let's check rates
males_rate = bysex[0][1]/np.sum(bysex[0])
females_rate = bysex[1][1]/np.sum(bysex[1])
print('Survival rate of:')
print('\tmales:\t\t',f'{males_rate:.3f}')
print('\tfemales:\t',f'{females_rate:.3f}')
Survival rate of: males: 0.189 females: 0.742
Indeed, there is quite a contrast between the two rates. Let's visualize survivorship.
labels = ['No Survive', 'Survive']
male = bysex[0]
female = bysex[1]
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width/2, male, width, label='Men')
rects2 = ax.bar(x + width/2, female, width, label='Women')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Passengers')
ax.set_title('Survival by Sex')
ax.set_xticks(x)
ax.set_ylim(0,500)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)
plt.show()
There are more males on board than there are females. Let's do an apples to apples comparison and visualize rate.
labels = ['No Survive', 'Survive']
male = np.round(bysex[0]/np.sum(bysex[0]),2)
female = np.round(bysex[1]/np.sum(bysex[1]),2)
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width/2, male, width, label='Male')
rects2 = ax.bar(x + width/2, female, width, label='Female')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Rate')
ax.set_title('Survival by Sex')
ax.set_xticks(x)
ax.set_ylim(0,1)
ax.set_xticklabels(labels)
ax.legend()
autolabel(rects1)
autolabel(rects2)
plt.show()
Now we calculate Pearson's r for being a male and surviving the disaster for inclusion in the final model.
df['male'] = df['Sex'] == 'male' # Create a column to indicate whether or not the passenger is male
df['male'].replace([True,False],[1,0],inplace=True) # Converting the results to a numeric value
# Define the input
X = np.array(df['male'])
y = np.array(df['Survived'])
# Calculate correlation
pear = stats.pearsonr(X,y)
print('is_male vs Survived')
print(f'\tCorrelation: {pear[0]:.2f}')
print(f'\tp-val: {pear[1]:.2e}')
is_male vs Survived Correlation: -0.54 p-val: 1.41e-69
There is a statistically significant indirect correlation between being a male and survival. Our model should include whether or not the passenger is male.
model.append('male')
model
['male']
An exploration is made into the correlation and significance of port of embarkation on surviving the shipwreck. We define a list of all ports
ports = df['Embarked'].unique()
ports
array(['S', 'C', 'Q'], dtype=object)
# Now we calculate and print the survival rate of passengers vs where the boarded.
byport = np.zeros((len(ports),2))
byport
for i in range(0,len(ports)):
byport[i] = df[df['Embarked'] == ports[i]]['Survived'].value_counts().sort_index()
print('Survival rate of:')
for i in range(0,len(ports)):
print(f'\t{ports[i]}: {byport[i][1]/np.sum(byport[i]):.2f}')
Survival rate of: S: 0.34 C: 0.55 Q: 0.39
# Visualizing these rates
labels = ['No Survive', 'Survive']
S = byport[0]
C = byport[1]
Q = byport[2]
x = np.arange(len(labels)) # the label locations
width = 0.2 # the width of the bars
fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width, S, width, label='Southampton')
rects2 = ax.bar(x , C, width, label='Cherbourg')
rects3 = ax.bar(x + width, Q, width, label='Queenstown')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Passengers')
ax.set_title('Survival by Port')
ax.set_xticks(x)
ax.set_ylim(0,500)
ax.set_xticklabels(labels)
ax.legend()
autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
plt.show()
# Now we do the same for the rate of surviving at each port.
labels = ['No Survive', 'Survive']
S = np.round(byport[0]/np.sum(byport[0]),2)
C = np.round(byport[1]/np.sum(byport[1]),2)
Q = np.round(byport[2]/np.sum(byport[2]),2)
x = np.arange(len(labels)) # the label locations
width = 0.2 # the width of the bars
fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(x - width, S, width, label='Southampton')
rects2 = ax.bar(x , C, width, label='Cherbourg')
rects3 = ax.bar(x + width, Q, width, label='Queenstown')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('rate')
ax.set_title('Survival by Port')
ax.set_xticks(x)
ax.set_ylim(0,1)
ax.set_xticklabels(labels)
ax.legend()
autolabel(rects1)
autolabel(rects2)
autolabel(rects3)
plt.show()
These are great pictures, but do they mean anything significant? Now I calculate the Pearson R correlation for the relationship between port and surviving.
portdum = pd.get_dummies(df['Embarked'])
ports = np.array(portdum.columns)
X = np.array(portdum)
for i in range(0,3):
if stats.pearsonr(X[:,i],y)[0] < 0.05:
print(f'Embarked {ports[i]}:')
print(f'\tcorr {stats.pearsonr(X[:,i],y)[0]:.3e}')
print(f'\tp-value {stats.pearsonr(X[:,i],y)[1]:.3e}')
model.append(f'port_{ports[i]}')
print()
Embarked Q: corr 3.650e-03 p-value 9.134e-01 Embarked S: corr -1.497e-01 p-value 7.223e-06
These values suggest that embarking at Cherbourg has a statistically significant direct correlation with surviving. Southampton has a similar indirect correlation. There is no statisically significant correlation between embarking at Queenstown and survival. Our model should include whether the passenger embarked at Cherbourg or Southampton.
model
['male', 'port_Q', 'port_S']
Each cabin number is prefixed with a letter which indicates on which deck the passenger's cabin was. The cabin of all passengers is not known so for those without a known cabin(and deck) number we will create a group.
df.loc[:,['Cabin','Survived']][0:5]
Cabin | Survived | |
---|---|---|
0 | NaN | 0 |
1 | C85 | 1 |
2 | NaN | 1 |
3 | C123 | 1 |
4 | NaN | 0 |
# Replace NaN with '??' to indicate that the cabin number and deck for this passenger is unknown.
df['Cabin'].replace(np.nan,'??',inplace=True)
# A function to return which deck the passenger was on.
def get_deck(cabin):
return str(cabin)[0]
df['deck'] = df['Cabin'].apply(get_deck)
decks = df['deck'].unique()
decks.sort()
print(decks)
df.loc[:,['deck','Survived']][0:5]
['?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T']
deck | Survived | |
---|---|---|
0 | ? | 0 |
1 | C | 1 |
2 | ? | 1 |
3 | C | 1 |
4 | ? | 0 |
Now we have a list of all decks. Let's look into how this variable effects survival.
# Initialize an array
bydeck = np.zeros((len(decks),2))
# Loop through all decks and calculate number of survivors on each.
for i in range(0,len(decks)):
if len(df[df['deck'] == decks[i]]['Survived']) > 1: # Did anyone survive on the deck?
bydeck[i] = df[df['deck'] == decks[i]]['Survived'].value_counts().sort_index()
elif df[df['deck'] == decks[8]]['Survived'].value_counts().index[0] == 0: # if no one survived, return 0
bydeck[i] = [df[df['deck'] == decks[8]]['Survived'].value_counts(),0]
print('Survival rate of:')
for i in range(0,len(decks)):
print(f'{decks[i]} {np.sum(bydeck[i]):.2f}')
Survival rate of: ? 687.00 A 15.00 B 47.00 C 59.00 D 33.00 E 32.00 F 13.00 G 4.00 T 1.00
The printout is informative, but not as revealing as a graph. Let's draw some axes.
labels = ['No Survive', 'Survive']
x = np.arange(0,2) # the label locations
width = 0.5 # the width of the bars
fig, ax = plt.subplots(figsize=(16,9))
for i in range(1,len(decks)):
b = ax.bar(x + (i-5)*width/5, bydeck[i], width/5, label=decks[i])
autolabel(b)
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Passengers')
ax.set_title('Survival by Deck')
ax.set_xticks(x)
#ax.set_ylim(0,505)
ax.set_xticklabels(labels)
ax.legend()
plt.show()
# A snippet to generate distinct colors for our bars.
import colorsys
N = 2*len(np.unique(decks))
print('colors: {}'.format(N))
rgb_list = []
HSV_tuples = []
for x in range(1,N+1):
if (x%2 == 0):
HSV_tuples.append(((x-1)*1.0/N, 1.0, 0.325))
else:
HSV_tuples.append((x*1.0/N, 1.0, 0.75))
RGB_tuples = map(lambda x: colorsys.hsv_to_rgb(*x), HSV_tuples)
for rgb in RGB_tuples:
color = [int(255*rgb[0]),int(255*rgb[1]),int(255*rgb[2])]
hxstr = '#'
for c in color:
hxstr = hxstr + '{0:0{1}X}'.format(c,2)
rgb_list.append(hxstr)
colors: 18
Let's also visualize survival rate per deck.
x = np.arange(-2.0,2.0) # the label locations
width = 5.0 # the width of the bars
fig, ax = plt.subplots(figsize=(16,9))
rects = []
for i in range(0,len(decks)):
b = plt.barh(i, np.round(bydeck[i,1]/np.sum(bydeck[i]),2), width/5, label=decks[i],color=rgb_list[2*i])
nb = plt.barh(i, -np.round(bydeck[i,0]/np.sum(bydeck[i]),2), width/5, label=decks[i],color=rgb_list[2*i+1])
plt.annotate(f'N = {str(int(np.sum(bydeck[i])))}',xy=[-0.20,i-0.1],c='white')
rects.append(b)
ax.set_xlabel('\nDeath Survival\nRate')
ax.set_ylabel('Deck')
ax.set_title('Survival by Deck')
ax.set_yticks(range(0,9))
ax.set_xlim(-1.1,1.1)
ax.set_yticklabels(decks)
ax.grid(axis='x')
plt.show()
It looks like if the deck is known, seems to indicate that the passenger survived. In fact, I would say that if the deck is unknown there is a significant chance the passenger did not survive. To validate my claim, the Pearson r is calculated.
# Translate the categorical deck values into numeric types
deckdum = pd.get_dummies(df['deck'])
decks = np.array(deckdum.columns)
#Define input (y is still survivors)
X = np.array(deckdum)
# Loop through decks and check for correlation
print('Significance p < 0.05')
print('==========================')
for i in range(0,len(decks)):
pr = stats.pearsonr(X[:,i],y)
significant = ( pr[1] < 0.05)
if significant: # if a significant correlation is found, do stuff
print(f'Deck {decks[i]} ')
print(f'\tcorr {pr[0]:.3f}')
print(f'\tp-value {pr[1]:.3e}')
print(f'\tSignificant: {significant}')
model.append(f'deck_{decks[i]}')
print('+------------------------+')
Significance p < 0.05 ========================== Deck ? corr -0.317 p-value 3.091e-22 Significant: True +------------------------+ Deck B corr 0.175 p-value 1.442e-07 Significant: True +------------------------+ Deck C corr 0.115 p-value 6.062e-04 Significant: True +------------------------+ Deck D corr 0.151 p-value 6.233e-06 Significant: True +------------------------+ Deck E corr 0.145 p-value 1.332e-05 Significant: True +------------------------+
There is a statistically significant correlation between staying on Deck B, C, D, E and survival. Conversely, if the deck of the passenger is unknown there is a stronger correlation with not surviving. There is no significant correlation for the other decks A, F, G, & T. The significant decks should be included in our model.
model
['male', 'port_Q', 'port_S', 'deck_?', 'deck_B', 'deck_C', 'deck_D', 'deck_E']
Since we filled in the missing age values, everyone on board now has an age value. Let's visualize the age distribution and investigate different age groups for correlation with surviving the disaster.
Let's make a histogram.
# Defining some parameters for our histogram
nbins = 20 # The number of bins in our histogram. More on this later.
ylim = df['Age'].max() # Maximum value for the histogram's x-axis
binwidth= ylim/nbins # The width of each bin
bins = np.arange(0,ylim,binwidth) # An array of evenly spaced bins "binwidth" apart from 0 to 80 years of age.
thedigital = np.digitize(df['Age'],bins,right=True) # assign each age to a bin
thecount = np.bincount(thedigital) # Count the members in each bin
# Create some axes
plt.figure(figsize=(8,6))
plt.bar(bins+binwidth/2,thecount[1:],width=binwidth*0.95)
plt.xticks(np.arange(0,88,8))
plt.title('Passenger Age Distibution')
plt.ylabel('Frequency')
plt.xlabel('Age (years)')
plt.savefig('./agedist.png',bbox_inches=