In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as random
from IPython.display import Image
# seaborn is under active development and throws some scary looking warnings
import warnings
# this will allow us to use the code in peace :)
warnings.filterwarnings("ignore")
```

Learn how to use the

**seaborn**package to produce beautiful plotsLearn about kernel density estimates

Learn appropriate ways of representing different types of data

In this lecture, we will learn about **seaborn**. **seaborn** is a package with many tools for data visualization. It allows you to make pretty plots. Almost anything in **seaborn** can be done using **matplotlib**, but with **seaborn**'s built-in functions you can reduce a lot of **matplotlib** code down to a single line. **seaborn** isn't just a pretty face. Its real power is in statistical data analysis. It has a lot of functions built in for visualizing the distribution of your data, for example.

Let's take a look at some of the plots we can make with this package. We can import it using:

In [2]:

```
import seaborn as sns
```

In some cases, we have distributions of data that don't look like a simple (e.g., normal) distribution, for example, the data could be bimodal or have skewed shaped distributions (remember the histogram of the elevation data from around the world with two humps).

Let's create some synthetic bimodal data by drawing from two separate normal/lognormal distributions and combine the two into two bimodal data sets. We do this by drawing from **random.normal( )** twice for two normal distributions ($x_1,x_2$) and twice from **random.lognormal( )** for two lognormal distributions ($y_1,y_2$).

In [3]:

```
xdata1=random.normal(20,25,5000) # first x draw
xdata2=random.normal(100,25,5000) # second x draw
ydata1=random.lognormal(2,0.1,8000) # first y draw
ydata2=random.lognormal(3,0.1,2000) # second y draw
xdata=np.append(xdata1,xdata2) # combine the two x data sets
ydata=np.append(ydata1,ydata2) # combine the two y data sets
```

In [4]:

```
# make the histogram
plt.hist(xdata,bins=50)
# put on a heavy (linewidth=3) vertical red line at the mean of xdata
plt.axvline(np.mean(xdata),color='red',linewidth=3);
```

We can see that our mean lies right between the twin peaks. Describing this distribution with statistics meant for normal distributions (mean or standard deviation) is just plain wrong.

Another way to represent the distribution of a set of datapoints is known as a *kernel density estimate* (kde). This places a 'kernel' (an assumed distribution at the data point level - usually a normal distribution) at each data point and then sums up the contributions from all the data points. Kernal density estimates avoid the awkwardness of choice of bin size associated with histograms, for example. (We just picked 50 in the plot above - why 50?).

Here are some data represented on a bar plot in the lefthand plot. And on the right, we illustrate the idea behind kernal density estimates. The black lines are the locations of individual datapoints and the red dashed lines are the kernels at each point. The heavy blue line is the kernel density estimate (the sum of all the red dashed lines).

In [5]:

```
Image('Figures/KDE.png')
```

Out[5]:

*Source:* https://commons.wikimedia.org/wiki/File:Comparison_of_1D_histogram_and_KDE.png *Wikimedia Creative Commons*]

**sns.kdeplot( )** function. The **shade** argument allows us to shade the area underneath the curve. By the way, in **matplotlib**, the same thing can be achieved using the function **plt.fill_between**.

In [6]:

```
sns.kdeplot(xdata,shade=True);
```

**sns.distplot( )** function.

In [6]:

```
sns.distplot(xdata);
```

As you can see, this is a lot quicker than how we were plotting our distribution in the lecture on statistics!

With our $ydata$ we can see that we also have a bimodal distribution, but there are far fewer data points in the wider mode (we only used 2000 of our 10000 points for this mode).

In [7]:

```
sns.distplot(ydata);
```

**matplotlib** plot.

In [8]:

```
plt.plot(xdata,ydata,'.');
```