#!/usr/bin/env python
# coding: utf-8

# # Visualization with Seaborn

# Matplotlib has been at the core of scientific visualization in Python for decades, but even avid users will admit it often leaves much to be desired.
# There are several complaints about Matplotlib that often come up:
# 
# - A common early complaint, which is now outdated: prior to version 2.0, Matplotlib's color and style defaults were at times poor and looked dated.
# - Matplotlib's API is relatively low-level. Doing sophisticated statistical visualization is possible, but often requires a *lot* of boilerplate code.
# - Matplotlib predated Pandas by more than a decade, and thus is not designed for use with Pandas `DataFrame` objects. In order to visualize data from a `DataFrame`, you must extract each `Series` and often concatenate them together into the right format. It would be nicer to have a plotting library that can intelligently use the `DataFrame` labels in a plot.
# 
# An answer to these problems is [Seaborn](http://seaborn.pydata.org/). Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas.
# 
# To be fair, the Matplotlib team has adapted to the changing landscape: it added the `plt.style` tools discussed in [Customizing Matplotlib: Configurations and Style Sheets](04.11-Settings-and-Stylesheets.ipynb), and Matplotlib is starting to handle Pandas data more seamlessly.
# But for all the reasons just discussed, Seaborn remains a useful add-on.
# 
# By convention, Seaborn is often imported as `sns`:

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

sns.set()  # seaborn's method to set its chart style


# ## Exploring Seaborn Plots
# 
# The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.
# 
# Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following *could* be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood), but the Seaborn API is much more convenient.

# ### Histograms, KDE, and Densities
# 
# Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables.
# We have seen that this is relatively straightforward in Matplotlib (see the following figure):

# In[2]:


data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])

for col in 'xy':
    plt.hist(data[col], density=True, alpha=0.5)


# Rather than just providing a histogram as a visual output, we can get a smooth estimate of the distribution using kernel density estimation (introduced in [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb)), which Seaborn does with ``sns.kdeplot`` (see the following figure):

# In[3]:


sns.kdeplot(data=data, shade=True);


# If we pass `x` and `y` columns to `kdeplot`, we instead get a two-dimensional visualization of the joint density (see the following figure):

# In[4]:


sns.kdeplot(data=data, x='x', y='y');


# We can see the joint distribution and the marginal distributions together using `sns.jointplot`, which we'll explore further later in this chapter.

# ### Pair Plots
# 
# When you generalize joint plots to datasets of larger dimensions, you end up with *pair plots*. These are very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other.
# 
# We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three Iris species:

# In[5]:


iris = sns.load_dataset("iris")
iris.head()


# Visualizing the multidimensional relationships among the samples is as easy as calling ``sns.pairplot`` (see the following figure):

# In[6]:


sns.pairplot(iris, hue='species', height=2.5);


# ### Faceted Histograms
# 
# Sometimes the best way to view data is via histograms of subsets, as shown in the following figure. Seaborn's `FacetGrid` makes this simple.
# We'll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data:[^1]
# 
# [^1]: The restaurant staff data used in this section divides employees into two sexes: female and male. Biological sex
# isn’t binary, but the following discussion and visualizations are limited by this data.

# 

# In[7]:


tips = sns.load_dataset('tips')
tips.head()


# In[8]:


tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));


# The faceted chart gives us some quick insights into the dataset: for example, we see that it contains far more data on male servers during the dinner hour than other categories, and typical tip amounts appear to range from approximately 10% to 20%, with some outliers on either end.
# 
# ### Categorical Plots
# 
# Categorical plots can be useful for this kind of visualization as well. These allow you to view the distribution of a parameter within bins defined by any other parameter, as shown in the following figure:

# In[9]:


with sns.axes_style(style='ticks'):
    g = sns.catplot(x="day", y="total_bill", hue="sex", data=tips, kind="box")
    g.set_axis_labels("Day", "Total Bill");


# ### Joint Distributions
# 
# Similar to the pair plot we saw earlier, we can use `sns.jointplot` to show the joint distribution between different datasets, along with the associated marginal distributions (see the following figure):

# In[10]:


with sns.axes_style('white'):
    sns.jointplot(x="total_bill", y="tip", data=tips, kind='hex')


# The joint plot can even do some automatic kernel density estimation and regression, as shown in the following figure:

# In[11]:


sns.jointplot(x="total_bill", y="tip", data=tips, kind='reg');


# ### Bar Plots
# 
# Time series can be plotted using `sns.factorplot`. In the following example, we'll use the Planets dataset that we first saw in [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb); see the following figure for the result:

# In[12]:


planets = sns.load_dataset('planets')
planets.head()


# In[13]:


with sns.axes_style('white'):
    g = sns.catplot(x="year", data=planets, aspect=2,
                    kind="count", color='steelblue')
    g.set_xticklabels(step=5)


# We can learn more by looking at the *method* of discovery of each of these planets (see the following figure):

# In[14]:


with sns.axes_style('white'):
    g = sns.catplot(x="year", data=planets, aspect=4.0, kind='count',
                    hue='method', order=range(2001, 2015))
    g.set_ylabels('Number of Planets Discovered')


# For more information on plotting with Seaborn, see the [Seaborn documentation](http://seaborn.pydata.org/), and particularly the [example gallery](https://seaborn.pydata.org/examples/index.html).

# ## Example: Exploring Marathon Finishing Times
# 
# Here we'll look at using Seaborn to help visualize and understand finishing results from a marathon.
# I've scraped the data from sources on the web, aggregated it and removed any identifying information, and put it on GitHub, where it can be downloaded
# (if you are interested in using Python for web scraping, I would recommend [*Web Scraping with Python*](http://shop.oreilly.com/product/0636920034391.do) by Ryan Mitchell, also from O'Reilly).
# We will start by downloading the data and loading it into Pandas:[^2]
# 
# [^2]: The marathon data used in this section divides runners into two genders: men and women. While gender is a
# spectrum, the following discussion and visualizations use this binary because they depend on the data.

# In[15]:


# url = ('https://raw.githubusercontent.com/jakevdp/'
#        'marathon-data/master/marathon-data.csv')
# !cd data && curl -O {url}


# In[16]:


data = pd.read_csv('data/marathon-data.csv')
data.head()


# Notice that Pandas loaded the time columns as Python strings (type `object`); we can see this by looking at the `dtypes` attribute of the `DataFrame`:

# In[17]:


data.dtypes


# Let's fix this by providing a converter for the times:

# In[18]:


import datetime

def convert_time(s):
    h, m, s = map(int, s.split(':'))
    return datetime.timedelta(hours=h, minutes=m, seconds=s)

data = pd.read_csv('data/marathon-data.csv',
                   converters={'split':convert_time, 'final':convert_time})
data.head()


# In[19]:


data.dtypes


# That will make it easier to manipulate the temporal data. For the purpose of our Seaborn plotting utilities, let's next add columns that give the times in seconds:

# In[20]:


data['split_sec'] = data['split'].view(int) / 1E9
data['final_sec'] = data['final'].view(int) / 1E9
data.head()


# To get an idea of what the data looks like, we can plot a `jointplot` over the data; the following figure shows the result:

# In[21]:


with sns.axes_style('white'):
    g = sns.jointplot(x='split_sec', y='final_sec', data=data, kind='hex')
    g.ax_joint.plot(np.linspace(4000, 16000),
                    np.linspace(8000, 32000), ':k')


# The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon.
# If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race.
# 
# Let's create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:

# In[22]:


data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head()


# Where this split difference is less than zero, the person negative-split the race by that fraction.
# Let's do a distribution plot of this split fraction (see the following figure):

# In[23]:


sns.displot(data['split_frac'], kde=False)
plt.axvline(0, color="k", linestyle="--");


# In[24]:


sum(data.split_frac < 0)


# Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.
# 
# Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a `PairGrid`, which draws plots of all these correlations (see the following figure):

# In[25]:


g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
                 hue='gender', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();


# It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time. Let's zoom in on the histogram of split fractions separated by gender, shown in the following figure:

# In[26]:


sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac');


# The interesting thing here is that there are many more men than women who are running close to an even split!
# It almost looks like a bimodal distribution among the men and women. Let's see if we can suss out what's going on by looking at the distributions as a function of age.
# 
# A nice way to compare distributions is to use a *violin plot*, shown in the following figure:

# In[27]:


sns.violinplot(x="gender", y="split_frac", data=data,
               palette=["lightblue", "lightpink"]);


# Let's look a little deeper, and compare these violin plots as a function of age (see the following figure). We'll start by creating a new column in the array that specifies the age range that each person is in, by decade:

# In[28]:


data['age_dec'] = data.age.map(lambda age: 10 * (age // 10))
data.head()


# In[29]:


men = (data.gender == 'M')
women = (data.gender == 'W')

with sns.axes_style(style=None):
    sns.violinplot(x="age_dec", y="split_frac", hue="gender", data=data,
                   split=True, inner="quartile",
                   palette=["lightblue", "lightpink"]);


# We can see where the distributions among men and women differ: the split distributions of men in their 20s to 50s show a pronounced overdensity toward lower splits when compared to women of the same age (or of any age, for that matter).
# 
# Also surprisingly, it appears that the 80-year-old women seem to outperform *everyone* in terms of their split time, although this is likely a small number effect, as there are only a handful of runners in that range:

# In[30]:


(data.age > 80).sum()


# Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We'll use `regplot`, which will automatically fit a linear regression model to the data (see the following figure):

# In[31]:


g = sns.lmplot(x='final_sec', y='split_frac', col='gender', data=data,
               markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.0, color="k", ls=":");


# Apparently, among both men and women, the people with fast splits tend to be faster runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split.