In [1]:

import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Read and load data¶

In [2]:

titanic = pd.read_csv("train.csv")
titanic.head()

Out[2]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Column description¶

PassengerId -- A numerical id assigned to each passenger.
Survived -- Whether the passenger survived (1), or didn't (0).
Pclass -- The class the passenger was in.
Name -- the name of the passenger.
Sex -- The gender of the passenger -- male or female.
Age -- The age of the passenger. Fractional.
SibSp -- The number of siblings and spouses the passenger had on board.
Parch -- The number of parents and children the passenger had on board.
Ticket -- The ticket number of the passenger.
Fare -- How much the passenger paid for the ticket.
Cabin -- Which cabin the passenger was in.
Embarked -- Where the passenger boarded the Titanic.

In [3]:

titanic.columns

Out[3]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Remove irrelevant columns¶

In [4]:

# Remove column ticket,name
titanic.drop(["Name", "Ticket"], axis = 1, inplace = True)

In [5]:

titanic.columns

Out[5]:

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Remove missing values¶

In [6]:

# Remove missing values
titanic.isnull().sum()

Out[6]:

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [7]:

titanic = titanic.dropna()
titanic.isnull().sum()

Out[7]:

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Cabin          0
Embarked       0
dtype: int64

Histogram¶

To get familiar with seaborn, we'll start by creating the familiar histogram.

Under the hood, seaborn creates a histogram using matplotlib, scales the axes values, and styles it. In addition, seaborn uses a technique called kernel density estimation, or KDE for short, to create a smoothed line chart over the histogram. If you're interested in learning about how KDE works, you can read more on Wikipedia.

What you need to know for now is that the resulting line is a smoother version of the histogram, called a kernel density plot. Kernel density plots are especially helpful when we're comparing distributions, which we'll explore later in this mission. When viewing a histogram, our visual processing systems influence us to smooth out the bars into a continuous line.

We can generate a histogram of the Fare column using the seaborn.distplot() function:

In [8]:

import seaborn as sns
import matplotlib.pyplot as plt

# Draw a histogram for Fare column
sns.distplot(titanic["Fare"])
plt.show()

In [9]:

# For age column histogram
sns.distplot(titanic["Age"])
plt.show()

While having both the histogram and the kernel density plot is useful when we want to explore the data, it can be overwhelming for someone who's trying to understand the distribution. To generate just the kernel density plot, we use the seaborn.kdeplot() function:

In [10]:

# Generate a kernel density plot
sns.kdeplot(titanic["Age"]);
plt.show()

While the distribution of data is displayed in a smoother fashion, it's now more difficult to visually estimate the area under the curve using just the line chart. When we also had the histogram, the bars provided a way to understand and compare proportions visually.

To bring back some of the ability to easily compare proportions, we can shade the area under the line using a single color. When calling the seaborn.kdeplot() function, we can shade the area under the line by setting the shade parameter to True.

In [11]:

sns.kdeplot(titanic["Age"], shade = True)
plt.xlabel("Age")
plt.show();

Modifying appearance of the plot¶

The default seaborn style sheet gets some things right, like hiding axis ticks, and some things wrong, like displaying the coordinate grid and keeping all of the axis spines. We can use the seaborn.set_style() function to change the default seaborn style sheet. Seaborn comes with a few style sheets:

darkgrid: Coordinate grid displayed, dark background color
whitegrid: Coordinate grid displayed, white background color
dark: Coordinate grid hidden, dark background color
white: Coordinate grid hidden, white background color
ticks: Coordinate grid hidden, white background color, ticks visible

By default, the seaborn style is set to "darkgrid":

sns.set_style("darkgrid")

If we change the style sheet using this method, all future plots will match that style in your current session. This means you need to set the style before generating the plot.

To remove the axis spines for the top and right axes, we use the seaborn.despine() function:

sns.despine()

By default, only the top and right axes will be despined, or have their spines removed. To despine the other two axes, we need to set the left and bottom parameters to True

In [12]:

# Set the style to the style sheet that hides the coordinate grid and sets the background color to white.
# Despine all of the axes.
sns.set_style("white")
sns.kdeplot(titanic["Age"], shade = True)
sns.despine(left = True, bottom = True)
plt.xlabel("Age")
plt.show();

Multiple plot (kernel density plot) age vs survival¶

In seaborn, we can create a small multiple by specifying the conditioning criteria and the type of data visualization we want. For example, we can visualize the differences in age distributions between passengers who survived and those who didn't by creating a pair of kernel density plots. One kernel density plot would visualize the distribution of values in the "Age" column where Survived equalled 0 and the other would visualize the distribution of values in the "Age" column where Survived equalled 1.

Here's what those plots look like:

In [13]:

# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)

# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True);

The function that's passed into FacetGrid.map() has to be a valid matplotlib or seaborn function. For example, we can map matplotlib histograms to the grid:

In [14]:

g = sns.FacetGrid(titanic, col="Survived", size=6)
g.map(plt.hist, "Age");

Let's create a grid of plots that displays the age distributions for each class.¶

In [15]:

# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Pclass", size=6)

# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True);

# Remove all the spines
sns.despine(left = True, bottom = True)

plt.show();

Creating conditional plots using three conditions¶

When subsetting data using two conditions, the rows in the grid represented one condition while the columns represented another. We can express a third condition by generating multiple plots on the same subplot in the grid and color them differently.

Thankfully, we can add a condition just by setting the hue parameter to the column name from the dataframe.

Let's add a new condition to the grid of plots we generated in the last step and see what this grid of plots would look like.

In [16]:

g = sns.FacetGrid(titanic, col="Survived", row="Pclass", hue = "Sex", size = 3)
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show();

In [ ]: