Basic Principles

Show different ways to present statistical data.

This script is written in MATLAB or IPpython style, to show how best to use Python interactively. Note than in IPython, the show() commands are automatically generated. The examples contain:

  • scatter plots
  • histograms
  • KDE
  • errorbars
  • boxplots
  • probplots
  • cumulative density functions
  • regression fits

Author: thomas haslwanter, March-2015

Getting things ready

First, import the libraries that you are going to need. You could also do that later, but it is better style to do that at the beginning. pylab imports the numpy, scipy, and matplotlib.pyplot libraries into the current environment

In [21]:
%pylab inline

import scipy.stats as stats
import seaborn as sns
sns.set_context('notebook')
sns.set_style('darkgrid')
Populating the interactive namespace from numpy and matplotlib
In [24]:
# Generate data that are normally distributed
x = randn(50)

Scatter plot

In [25]:
plot(x,'.')
title('Scatter Plot')
xlabel('X')
ylabel('Y')
draw()

Histogram

In [26]:
hist(x)
xlabel('Data Values')
ylabel('Frequency')
title('Histogram, default settings')
Out[26]:
<matplotlib.text.Text at 0x24320fd3550>
In [27]:
x = randn(1000)
In [29]:
hist(x,25)
xlabel('Data Values')
ylabel('Frequency')
title('Histogram, 25 bins')
Out[29]:
<matplotlib.text.Text at 0x2432172c978>

KDE

Kernel Density Estimation (KDE)

In [30]:
import seaborn as sns
sns.kdeplot(x)
xlabel('Data Values')
ylabel('Density')
Out[30]:
<matplotlib.text.Text at 0x243215c6e10>

Cumulative probability density

In [31]:
numbins = 20
cdf = stats.cumfreq(x,numbins)
plot(cdf[0])
xlabel('Data Values')
ylabel('Cumulative Frequency')
title('Cumulative probablity density function')
Out[31]:
<matplotlib.text.Text at 0x24320f4f9b0>

Boxplot

In [34]:
# The error bars indicate 1.5* the inter-quartile-range (IQR), and the box consists of the
# first, second (middle) and third quartile
boxplot(x, sym='o')
title('Boxplot')
ylabel('Values')
Out[34]:
<matplotlib.text.Text at 0x24320fe1550>
In [35]:
boxplot(x, vert=False, sym='*')
title('Boxplot, horizontal')
xlabel('Values')
Out[35]:
<matplotlib.text.Text at 0x24321737748>

Errorbars

In [10]:
x = arange(5)
y = x**2
errorBar = x/2
errorbar(x,y, yerr=errorBar, fmt='o', capsize=5, capthick=3)

plt.xlabel('Data Values')
plt.ylabel('Measurements')
plt.title('Errorbars')

xlim([-0.2, 4.2])
ylim([-0.2, 19])
Out[10]:
(-0.2, 19)

Check for Normality

In [36]:
# Visual check
x = randn(100)
_ = stats.probplot(x, plot=plt)
title('Probplot - check for normality')
Out[36]:
<matplotlib.text.Text at 0x24320df7438>

2D Plot

In [37]:
# Generate data
x = randn(200)
y = 10+0.5*x+randn(len(x))

# Scatter plot
scatter(x,y)
# This one is quite similar to "plot(x,y,'.')"
title('Scatter plot of data')
xlabel('X')
ylabel('Y')
Out[37]:
<matplotlib.text.Text at 0x24320e67dd8>

LineFit

In [14]:
M = vstack((ones(len(x)), x)).T
pars = linalg.lstsq(M,y)[0]
intercept = pars[0]
slope = pars[1]
scatter(x,y)
plot(x, intercept + slope*x, 'r')
show()