mlcourse.ai – Open Machine Learning Course¶

Author: Yury Kashnitsky. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Topic 1. Exploratory data analysis with Pandas

Practice. Analyzing "Titanic" passengers. Solution

Fill in the missing code ("Your code here") and choose answers in a web-form.

In [ ]:

import numpy as np
import pandas as pd

pd.set_option("display.precision", 2)
from matplotlib import pyplot as plt

# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg'

Read data into a Pandas DataFrame

In [ ]:

data = pd.read_csv("../../data/titanic_train.csv", index_col="PassengerId")

First 5 rows

In [ ]:

data.head(5)

In [ ]:

data.describe()

Let's select those passengers who embarked in Cherbourg (Embarked=C) and paid > 200 pounds for their ticker (fare > 200).

Make sure you understand how actually this construction works.

In [ ]:

data[(data["Embarked"] == "C") & (data.Fare > 200)].head()

We can sort these people by Fare in descending order.

In [ ]:

data[(data["Embarked"] == "C") & (data["Fare"] > 200)].sort_values(
    by="Fare", ascending=False
).head()

Let's create a new feature.

In [ ]:

def age_category(age):
    """
    < 30 -> 1
    >= 30, <55 -> 2
    >= 55 -> 3
    """
    if age < 30:
        return 1
    elif age < 55:
        return 2
    elif age >= 55:
        return 3

In [ ]:

age_categories = [age_category(age) for age in data.Age]
data["Age_category"] = age_categories

Another way is to do it with apply.

In [ ]:

data["Age_category"] = data["Age"].apply(age_category)

1. How many men/women were there onboard?

412 men and 479 women
314 men и 577 women
479 men и 412 women
577 men и 314 women [+]

In [ ]:

(data["Sex"] == "male").sum(), (data["Sex"] == "female").sum()

Easier:

In [ ]:

data["Sex"].value_counts()

2. Print the distribution of the Pclass feature. Then the same, but for men and women separately. How many men from second class were there onboard?

104
108 [+]
112
125

In [ ]:

pd.crosstab(data["Pclass"], data["Sex"], margins=True)

We can plot a picture as well, though it's not necessary here.

In [ ]:

data["Pclass"].hist(label="all")
data[data["Sex"] == "male"]["Pclass"].hist(color="green", label="male")
data[data["Sex"] == "female"]["Pclass"].hist(color="yellow", label="female")
plt.title("Distribution by class and gender.")
plt.xlabel("Pclass")
plt.ylabel("Frequency")
plt.legend(loc="upper left");

3. What are median and standard deviation of Fare?. Round to two decimals.

median is 14.45, standard deviation is 49.69 [+]
median is 15.1, standard deviation is 12.15
median is 13.15, standard deviation is 35.3
median is 17.43, standard deviation is 39.1

In [ ]:

print("Median fare: ", round(data["Fare"].median(), 2))
print("Fare std: ", round(data["Fare"].std(), 2))

4. Is that true that the mean age of survived people is higher than that of passengers who eventually died?

Yes
No [+]

In [ ]:

data[data["Survived"] == 1]["Age"].hist(
    color="green", label="Survived", alpha=0.5, density=True
)
data[data["Survived"] == 0]["Age"].hist(
    color="red", label="Died", alpha=0.5, density=True
)
plt.title("Age for survived and died")
plt.xlabel("Years")
plt.ylabel("Frequency")
plt.legend();

In [ ]:

#!pip install seaborn
import seaborn as sns

sns.set()

In [ ]:

sns.boxplot(data["Survived"], data["Age"]);

Can't see the difference through eye-balling only. Let's calculate.

In [ ]:

data.groupby("Survived")["Age"].mean()

5. Is that true that passengers younger than 30 y.o. survived more frequently than those older than 60 y.o.? What are shares of survived people among young and old people?

22.7% among young and 40.6% among old
40.6% among young and 22.7% among old [+]
35.3% among young and 27.4% among old
27.4% among young and 35.3% among old

In [ ]:

young_survived = data.loc[data["Age"] < 30, "Survived"]
old_survived = data.loc[data["Age"] > 60, "Survived"]

print(
    "Shares of survived people: \n\t  among young {}%, \n\t  among old {}%.".format(
        round(100 * young_survived.mean(), 1), round(100 * old_survived.mean(), 1)
    )
)

6. Is that true that women survived more frequently than men? What are shares of survived people among men and women?

30.2% among men and 46.2% among women
35.7% among men and 74.2% among women
21.1% among men and 46.2% among women
18.9% among men and 74.2% among women [+]

In [ ]:

male_survived = data[data["Sex"] == "male"]["Survived"]
female_survived = data[data["Sex"] == "female"]["Survived"]


print(
    "Shares of survived people: \n\t among women {}%, \n\t among men {}%".format(
        round(100 * female_survived.mean(), 1), round(100 * male_survived.mean(), 1)
    )
)

7. What's the most popular first name among male passengers?

Charles
Thomas
William [+]
John

In [ ]:

data["Name"].head()

In [ ]:

data.loc[1, "Name"].split(",")[1].split()[1]

In [ ]:

first_names = data.loc[data["Sex"] == "male", "Name"].apply(
    lambda full_name: full_name.split(",")[1].split()[1]
)
first_names.value_counts().head()

8. How is average age for men/women dependent on Pclass? Choose all correct statements:

On average, men of 1 class are older than 40 [+]
On average, women of 1 class are older than 40
Men of all classes are on average older than women of the same class [+]
On average, passengers ofthe first class are older than those of the 2nd class who are older than passengers of the 3rd class [+]

In [ ]:

for cl in data["Pclass"].unique():
    for sex in data["Sex"].unique():
        print(
            "Average age for {0} and class {1}: {2}".format(
                sex,
                cl,
                round(
                    data[(data["Sex"] == sex) & (data["Pclass"] == cl)]["Age"].mean(), 2
                ),
            )
        )

Nicer:

In [ ]:

for (cl, sex), sub_df in data.groupby(["Pclass", "Sex"]):
    print(
        "Average age for {0} and class {1}: {2}".format(
            sex, cl, round(sub_df["Age"].mean(), 2)
        )
    )

And even nicer:

In [ ]:

pd.crosstab(data["Pclass"], data["Sex"], values=data["Age"], aggfunc=np.mean)

In [ ]:

sns.boxplot(data["Pclass"], data["Age"]);

Useful resources¶

The same notebook as an interactive web-based Kaggle Kernel
Topic 1 "Exploratory Data Analysis with Pandas" as a Kaggle Kernel
Main course site, course repo, and YouTube channel