In [2]:
%run ../../../common_functions/import_all.py

from common_functions.setup_notebook import set_css_style, setup_matplotlib, config_ipython
config_ipython()
setup_matplotlib()
set_css_style()
Out[2]:

Displaying data with histograms & boxplots

This notebook reports some code to extract data according to a few prob distribution, and displaying them.

Uniform distribution

In a uniform distribution all values are equiprobable. Let's say we uniformly extract $10^5$ integer data points in the interval $[0, 10)$:

In [3]:
n = 100000

data = [random.randint(0, 9) for i in range(n)]

and then let's compute, for each of the 10 possible values, how many of these points are there, which is equivalent to say that we are binning with a bin width of 1:

In [4]:
bins = np.arange(0, 11, 1)

# Count the number of items falling in each bin
bin_counts = [data.count(item) for item in range(0, 10)]
    
plt.bar(range(0, 10), bin_counts, width=1, edgecolor='k')
plt.xticks(bins)
plt.title('Histogram of $10^5$ uniformly distributed data, counts')
plt.xlabel('Value')
plt.ylabel('Count items')
plt.savefig('uniform-dist.png', dpi=200)
plt.show();

It's plain visible that bins contain pretty much the same number of values, namely around $10^4$, which is our total divided by the number of bins itself. Indeed, the difference between the highest and the lowest counts in a bins is

In [5]:
max(bin_counts) - min(bin_counts)
Out[5]:
312

corresponding to a proportion of the total number of points equal to

In [6]:
(max(bin_counts) - min(bin_counts) ) / n
Out[6]:
0.00312

very little!

This histogram is doing a good job in showing us the data is uniformly distributed. What we have shown is, again, the number of values in each bin, which is not the probability.

In order to have a PMF instead, we'd have to ideally take, for each of the values extracted, its count and then divide it by the total number of values to obtain frequency counts. Note that these are the probabilities of each possible value and they sum up to 1:

In [7]:
freq_counts = [item / n for item in bin_counts]
sum(freq_counts)
Out[7]:
0.9999999999999999

Then we can easily plot them to obtain the PMF histogram:

In [8]:
plt.bar(range(0, 10), freq_counts, width=1, edgecolor='k')
plt.xticks(range(0, 10))
plt.title('Histogram of $10^5$ uniformly distributed data, probs')
plt.xlabel('Value')
plt.ylabel('Frequency (prob) items')
plt.savefig('pmf-uniform.png', dpi=200)
plt.show();

A gaussian distribution

Now let's consider a gaussian distribution instead, taking the same amount ($10^5$) of numbers and plotting the bins counts again. This time we extract float numbers, randomly sampled from a gaussian distribution of mean 0 and standard deviation 1. We then separate the range in 20 bins and plot the histogram of the counts of each bin as above. We use a line to signify that effectively our variable is meant to be continuous.

We attribute counts for a bin to the middle point of the bin.

In [9]:
data = np.random.normal(size=n)
In [10]:
bins = 20                                                                            # choose to separate into 20 bins
hist = np.histogram(data, bins=bins)
hist_vals, bin_edges = hist[0], hist[1]
bin_mids = [(bin_edges[i] + bin_edges[i+1])/2 for i in range(len(bin_edges) -1)]     # mids of bins again
  
plt.plot(bin_mids, hist_vals, marker='o')
plt.title('Histogram $10^5$ normally distributed data')
plt.xlabel('Bin mid')
plt.ylabel('Count items')
plt.savefig('gaussian-dist.png', dpi=200)
plt.show();

Each bin is large

In [11]:
bin_edges[1] - bin_edges[0]
Out[11]:
0.42651501057702657

So, it is quite clear from the plot that the mean is indeed at 0. We can also do the same histogram but showing the pdf instead:

In [12]:
bins = 20
hist = np.histogram(data, bins=bins, density=True)
hist_vals, bin_edges = hist[0], hist[1]
bin_mids = [(bin_edges[i] + bin_edges[i+1])/2 for i in range(len(bin_edges) -1)]

plt.plot(bin_mids, hist_vals, marker='o')
plt.title('Histogram $10^5$ normally distributed data')
plt.xlabel('Bin mid')
plt.ylabel('PDF')
plt.savefig('pdf-gaussian.png', dpi=200)
plt.show();

Because what we plotted above here^ is a density of probability, what sums up to 1 is not those values but the product of value times the bin width:

In [13]:
sum([(bin_edges[i+1] - bin_edges[i]) * hist_vals[i] for i in range(len(hist_vals))])
Out[13]:
1.0000000000000002

Effectively indeed, if we take for instance the first bin, its density represents the probability of being in that bin divided by the bin width itself, which is:

In [14]:
hist_vals[0]
Out[14]:
0.00018756666943976733

The power-law

In [16]:
# Use 100 bins
bins = np.logspace(0, 4, num=100)   
z = np.random.zipf(a=2, size=n)                     # Zipf (power-law) x^{-2}

hist = np.histogram(z, bins=bins)
hist_vals, bin_edges = hist[0], hist[1]
bin_mids = [(bin_edges[i] + bin_edges[i+1])/2 for i in range(len(bin_edges) -1)]     # middle point of bin

# Main plot: in linear scale
plt.plot(bin_mids, hist_vals)
plt.xlabel('Bin mid')
plt.ylabel('Count items')
plt.title('Histogram $10^5$ pow-law distributed data')

# Inset plot: in semilog (on y)
a = plt.axes([.4, .4, .4, .4], facecolor='y')
plt.loglog(bin_mids, hist_vals)
plt.title('In log-log scale')
plt.ylabel('Count items')
plt.xlabel('Bin mid')

plt.savefig('powlaw.png', dpi=200)
plt.show();

Using boxplots as a visualisation tool

Another very useful and quite comprehensive way to display distributions is through the use of boxplots. Boxplots let you see, in one go, the quartiles, the mean and the potential outliers in a distribution.

To start off with, let's generate $10^5$ random numbers extracting them from a (in order):

  • uniform distribution
  • gaussian distribution with mean 0 and standard deviation 1
  • exponential distribution
  • Zipf distribution $\propto x^{-2}$

Let's set the styles of our boxplots: the mean will be a red romboid point, the median a red line, the outliers blue circles.

Also, we identify as outliers those points which exceed the first or the third quartiles (on the respective sides) by 1.5 their value, according to Tukey's criterion.

Following on, we create boxplots for each of our distributions.

In [16]:
# Extract the random points

n = 100000                                                             # the number of points to extract

u = np.random.uniform(size=n)                                          # uniform
g = np.random.normal(size=n)                                           # gaussian
e = np.random.exponential(size=n)                                      # exponential
z = np.random.zipf(a=2, size=n)                                        # Zipf (power-law) x^{-2}
In [17]:
# Set style of the plot

# Style of the point for the mean
meanpointprops = dict(marker='D', markeredgecolor='black', markerfacecolor='firebrick')

# Style of the outliers points
flierpointprops = dict(marker='o', markeredgecolor='blue', linestyle='none')

# Style of the median line
medianlineprops = dict(linewidth=2.5, color='firebrick')
In [18]:
# Uniform distribution - boxplotting

ax = plt.subplot()

ax.boxplot(u, whis=1.5, showmeans=True, vert=False,
           meanprops=meanpointprops, flierprops=flierpointprops, medianprops=medianlineprops)

plt.title('Box plot of a uniform distribution')
plt.savefig('uniform-box.png', dpi=200)
plt.show();

Clearly the mean is in the middle of values and there are no outliers.

In [19]:
# The gaussian distribution

ax = plt.subplot()

ax.boxplot(g, whis=1.5, showmeans=True, vert=False,
           meanprops=meanpointprops, flierprops=flierpointprops, medianprops=medianlineprops)

plt.title('Box plot of a gaussian distribution with mean 0 and std 1')
plt.savefig('gaussian-box.png', dpi=200)
plt.show();

The symmetrical tail of "outliers" is visibile.

In [20]:
# The exponential distribution

ax = plt.subplot()

ax.boxplot(e, whis=1.5, showmeans=True, vert=False,
           meanprops=meanpointprops, flierprops=flierpointprops, medianprops=medianlineprops)

plt.title('Box plot of an exponential distribution')
plt.savefig('exp-box.png', dpi=200)
plt.show();

This time the distribution is heavily non-symmetrical and the tail of exceeding values is clear.

In [21]:
# The Zipf distribution

ax = plt.subplot()

ax.boxplot(z, whis=1.5, showmeans=True, vert=False,
           meanprops=meanpointprops, flierprops=flierpointprops, medianprops=medianlineprops)

plt.title('Box plot of a Zipf distribution')
plt.savefig('zipf-box.png', dpi=200)
plt.show();

It's a heavy-tail distribution, the quartiles are very close with respect to the full span of points.