import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as random
from IPython.display import Image
# seaborn is under active development and throws some scary looking warnings
import warnings
# this will allow us to use the code in peace :)
warnings.filterwarnings("ignore")
Learn how to use the seaborn package to produce beautiful plots
Learn about kernel density estimates
Learn appropriate ways of representing different types of data
In this lecture, we will learn about seaborn. seaborn is a package with many tools for data visualization. It allows you to make pretty plots. Almost anything in seaborn can be done using matplotlib, but with seaborn's built-in functions you can reduce a lot of matplotlib code down to a single line. seaborn isn't just a pretty face. Its real power is in statistical data analysis. It has a lot of functions built in for visualizing the distribution of your data, for example.
Let's take a look at some of the plots we can make with this package. We can import it using:
import seaborn as sns
In some cases, we have distributions of data that don't look like a simple (e.g., normal) distribution, for example, the data could be bimodal or have skewed shaped distributions (remember the histogram of the elevation data from around the world with two humps).
Let's create some synthetic bimodal data by drawing from two separate normal/lognormal distributions and combine the two into two bimodal data sets. We do this by drawing from random.normal( ) twice for two normal distributions ($x_1,x_2$) and twice from random.lognormal( ) for two lognormal distributions ($y_1,y_2$).
xdata1=random.normal(20,25,5000) # first x draw
xdata2=random.normal(100,25,5000) # second x draw
ydata1=random.lognormal(2,0.1,8000) # first y draw
ydata2=random.lognormal(3,0.1,2000) # second y draw
xdata=np.append(xdata1,xdata2) # combine the two x data sets
ydata=np.append(ydata1,ydata2) # combine the two y data sets
When we plot our xdata as a histogram, we can see that we have a broadly bimodal distribution. For fun, let's also plot the mean of the distribution as a red line.
# make the histogram
plt.hist(xdata,bins=50)
# put on a heavy (linewidth=3) vertical red line at the mean of xdata
plt.axvline(np.mean(xdata),color='red',linewidth=3);
We can see that our mean lies right between the twin peaks. Describing this distribution with statistics meant for normal distributions (mean or standard deviation) is just plain wrong.
Another way to represent the distribution of a set of datapoints is known as a kernel density estimate (kde). This places a 'kernel' (an assumed distribution at the data point level - usually a normal distribution) at each data point and then sums up the contributions from all the data points. Kernal density estimates avoid the awkwardness of choice of bin size associated with histograms, for example. (We just picked 50 in the plot above - why 50?).
Here are some data represented on a bar plot in the lefthand plot. And on the right, we illustrate the idea behind kernal density estimates. The black lines are the locations of individual datapoints and the red dashed lines are the kernels at each point. The heavy blue line is the kernel density estimate (the sum of all the red dashed lines).
Image('Figures/KDE.png')
[Source: https://commons.wikimedia.org/wiki/File:Comparison_of_1D_histogram_and_KDE.png Wikimedia Creative Commons]
Happily, we can plot kernel density estimates using the sns.kdeplot( ) function. The shade argument allows us to shade the area underneath the curve. By the way, in matplotlib, the same thing can be achieved using the function plt.fill_between.
sns.kdeplot(xdata,shade=True);
Well that was fairly painless! We can also plot the kernal density estimate and the histogram on top of one another using the sns.distplot( ) function.
sns.distplot(xdata);
As you can see, this is a lot quicker than how we were plotting our distribution in the lecture on statistics!
With our $ydata$ we can see that we also have a bimodal distribution, but there are far fewer data points in the wider mode (we only used 2000 of our 10000 points for this mode).
sns.distplot(ydata);
What if our data had both $x$ and $y$ components. For example, measurements of length and width from a set of shark teeth with two species in it. How would we visualize it? Let's just try plotting the $x$ data on the x axis and $y$ data on the y axis as dots on a regular matplotlib plot.
plt.plot(xdata,ydata,'.');