In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

Read and load data

In [2]:
titanic = pd.read_csv("train.csv")
titanic.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Column description

  • PassengerId -- A numerical id assigned to each passenger.
  • Survived -- Whether the passenger survived (1), or didn't (0).
  • Pclass -- The class the passenger was in.
  • Name -- the name of the passenger.
  • Sex -- The gender of the passenger -- male or female.
  • Age -- The age of the passenger. Fractional.
  • SibSp -- The number of siblings and spouses the passenger had on board.
  • Parch -- The number of parents and children the passenger had on board.
  • Ticket -- The ticket number of the passenger.
  • Fare -- How much the passenger paid for the ticket.
  • Cabin -- Which cabin the passenger was in.
  • Embarked -- Where the passenger boarded the Titanic.
In [3]:
titanic.columns
Out[3]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Remove irrelevant columns

In [4]:
# Remove column ticket,name
titanic.drop(["Name", "Ticket"], axis = 1, inplace = True)
In [5]:
titanic.columns
Out[5]:
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Remove missing values

In [6]:
# Remove missing values
titanic.isnull().sum()
Out[6]:
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [7]:
titanic = titanic.dropna()
titanic.isnull().sum()
Out[7]:
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Cabin          0
Embarked       0
dtype: int64

Histogram

To get familiar with seaborn, we'll start by creating the familiar histogram.

Under the hood, seaborn creates a histogram using matplotlib, scales the axes values, and styles it. In addition, seaborn uses a technique called kernel density estimation, or KDE for short, to create a smoothed line chart over the histogram. If you're interested in learning about how KDE works, you can read more on Wikipedia.

What you need to know for now is that the resulting line is a smoother version of the histogram, called a kernel density plot. Kernel density plots are especially helpful when we're comparing distributions, which we'll explore later in this mission. When viewing a histogram, our visual processing systems influence us to smooth out the bars into a continuous line.

We can generate a histogram of the Fare column using the seaborn.distplot() function:

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt

# Draw a histogram for Fare column
sns.distplot(titanic["Fare"])
plt.show()
In [9]:
# For age column histogram
sns.distplot(titanic["Age"])
plt.show()

While having both the histogram and the kernel density plot is useful when we want to explore the data, it can be overwhelming for someone who's trying to understand the distribution. To generate just the kernel density plot, we use the seaborn.kdeplot() function:

In [10]:
# Generate a kernel density plot
sns.kdeplot(titanic["Age"]);
plt.show()

While the distribution of data is displayed in a smoother fashion, it's now more difficult to visually estimate the area under the curve using just the line chart. When we also had the histogram, the bars provided a way to understand and compare proportions visually.

To bring back some of the ability to easily compare proportions, we can shade the area under the line using a single color. When calling the seaborn.kdeplot() function, we can shade the area under the line by setting the shade parameter to True.

In [11]:
sns.kdeplot(titanic["Age"], shade = True)
plt.xlabel("Age")
plt.show();

Modifying appearance of the plot

The default seaborn style sheet gets some things right, like hiding axis ticks, and some things wrong, like displaying the coordinate grid and keeping all of the axis spines. We can use the seaborn.set_style() function to change the default seaborn style sheet. Seaborn comes with a few style sheets:

  • darkgrid: Coordinate grid displayed, dark background color
  • whitegrid: Coordinate grid displayed, white background color
  • dark: Coordinate grid hidden, dark background color
  • white: Coordinate grid hidden, white background color
  • ticks: Coordinate grid hidden, white background color, ticks visible

By default, the seaborn style is set to "darkgrid":

  • sns.set_style("darkgrid")

If we change the style sheet using this method, all future plots will match that style in your current session. This means you need to set the style before generating the plot.

To remove the axis spines for the top and right axes, we use the seaborn.despine() function:

  • sns.despine()

By default, only the top and right axes will be despined, or have their spines removed. To despine the other two axes, we need to set the left and bottom parameters to True

In [12]:
# Set the style to the style sheet that hides the coordinate grid and sets the background color to white.
# Despine all of the axes.
sns.set_style("white")
sns.kdeplot(titanic["Age"], shade = True)
sns.despine(left = True, bottom = True)
plt.xlabel("Age")
plt.show();

Multiple plot (kernel density plot) age vs survival

In seaborn, we can create a small multiple by specifying the conditioning criteria and the type of data visualization we want. For example, we can visualize the differences in age distributions between passengers who survived and those who didn't by creating a pair of kernel density plots. One kernel density plot would visualize the distribution of values in the "Age" column where Survived equalled 0 and the other would visualize the distribution of values in the "Age" column where Survived equalled 1.

Here's what those plots look like:

In [13]:
# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)

# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True);

The function that's passed into FacetGrid.map() has to be a valid matplotlib or seaborn function. For example, we can map matplotlib histograms to the grid:

In [14]:
g = sns.FacetGrid(titanic, col="Survived", size=6)
g.map(plt.hist, "Age");

Let's create a grid of plots that displays the age distributions for each class.

In [15]:
# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Pclass", size=6)

# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True);

# Remove all the spines
sns.despine(left = True, bottom = True)

plt.show();

Creating conditional plots using three conditions

When subsetting data using two conditions, the rows in the grid represented one condition while the columns represented another. We can express a third condition by generating multiple plots on the same subplot in the grid and color them differently.

Thankfully, we can add a condition just by setting the hue parameter to the column name from the dataframe.

Let's add a new condition to the grid of plots we generated in the last step and see what this grid of plots would look like.

In [16]:
g = sns.FacetGrid(titanic, col="Survived", row="Pclass", hue = "Sex", size = 3)
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show();
In [ ]: