Author: Yury Kashnitsky. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.
Fill in the missing code ("Your code here") and choose answers in a web-form.
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)
from matplotlib import pyplot as plt
# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'
Read data into a Pandas DataFrame
data = pd.read_csv("../../data/titanic_train.csv", index_col="PassengerId")
First 5 rows
data.head(5)
data.describe()
Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).
Make sure you understand how actually this construction works.
data[(data["Embarked"] == "C") & (data.Fare > 200)].head()
We can sort these people by Fare in descending order.
data[(data["Embarked"] == "C") & (data["Fare"] > 200)].sort_values(
by="Fare", ascending=False
).head()
Let's create a new feature.
def age_category(age):
"""
< 30 -> 1
>= 30, <55 -> 2
>= 55 -> 3
"""
if age < 30:
return 1
elif age < 55:
return 2
elif age >= 55:
return 3
age_categories = [age_category(age) for age in data.Age]
data["Age_category"] = age_categories
Another way is to do it with apply
.
data["Age_category"] = data["Age"].apply(age_category)
1. How many men/women were there onboard?
(data["Sex"] == "male").sum(), (data["Sex"] == "female").sum()
Easier:
data["Sex"].value_counts()
2. Print the distribution of the Pclass
feature. Then the same, but for men and women separately. How many men from second class were there onboard?
pd.crosstab(data["Pclass"], data["Sex"], margins=True)
We can plot a picture as well, though it's not necessary here.
data["Pclass"].hist(label="all")
data[data["Sex"] == "male"]["Pclass"].hist(color="green", label="male")
data[data["Sex"] == "female"]["Pclass"].hist(color="yellow", label="female")
plt.title("Distribution by class and gender.")
plt.xlabel("Pclass")
plt.ylabel("Frequency")
plt.legend(loc="upper left");
3. What are median and standard deviation of Fare
?. Round to two decimals.
print("Median fare: ", round(data["Fare"].median(), 2))
print("Fare std: ", round(data["Fare"].std(), 2))
4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?
data[data["Survived"] == 1]["Age"].hist(
color="green", label="Survived", alpha=0.5, density=True
)
data[data["Survived"] == 0]["Age"].hist(
color="red", label="Died", alpha=0.5, density=True
)
plt.title("Age for survived and died")
plt.xlabel("Years")
plt.ylabel("Frequency")
plt.legend();
#!pip install seaborn
import seaborn as sns
sns.set()
sns.boxplot(data["Survived"], data["Age"]);
Can't see the difference through eye-balling only. Let's calculate.
data.groupby("Survived")["Age"].mean()
5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?
young_survived = data.loc[data["Age"] < 30, "Survived"]
old_survived = data.loc[data["Age"] > 60, "Survived"]
print(
"Shares of survived people: \n\t among young {}%, \n\t among old {}%.".format(
round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)
)
)
6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?
male_survived = data[data["Sex"] == "male"]["Survived"]
female_survived = data[data["Sex"] == "female"]["Survived"]
print(
"Shares of survived people: \n\t among women {}%, \n\t among men {}%".format(
round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)
)
)
7. What's the most popular first name among male passengers?
data["Name"].head()
data.loc[1, "Name"].split(",")[1].split()[1]
first_names = data.loc[data["Sex"] == "male", "Name"].apply(
lambda full_name: full_name.split(",")[1].split()[1]
)
first_names.value_counts().head()
8. How is average age for men/women dependent on Pclass
? Choose all correct statements:
for cl in data["Pclass"].unique():
for sex in data["Sex"].unique():
print(
"Average age for {0} and class {1}: {2}".format(
sex,
cl,
round(
data[(data["Sex"] == sex) & (data["Pclass"] == cl)]["Age"].mean(), 2
),
)
)
Nicer:
for (cl, sex), sub_df in data.groupby(["Pclass", "Sex"]):
print(
"Average age for {0} and class {1}: {2}".format(
sex, cl, round(sub_df["Age"].mean(), 2)
)
)
And even nicer:
pd.crosstab(data["Pclass"], data["Sex"], values=data["Age"], aggfunc=np.mean)
sns.boxplot(data["Pclass"], data["Age"]);