Notebook

Chapter 4: Expectation¶

This Jupyter notebook is the Python equivalent of the R code in section 4.11 R, pp. 175 - 178, Introduction to Probability, 1st Edition, Blitzstein & Hwang.

In [1]:

import numpy as np

Geometric, Negative Binomial, and Poisson¶

The three functions for the Geometric distribution in SciPy's scipy.stats.geom are pmf, cdf, and rvs, corresponding to the PMF, CDF, and random number generation. For pmf and cdf, we need to supply the following as inputs: (1) the value k at which to evaluate the PMF or CDF, and (2) the parameter p. For rvs, we need to input (1) the number of random variables to generate and (2) the parameter p.

For example, to calculate $P(X = 3)$ and $P(X \leq 3)$ where $X \sim Geom(0.5)$, we use geom.pmf(3, 0.5) and geom.cdf(3, 0.5), respectively. To generate 100 i.i.d. $Geom(0.8)$ r.v.s, we use geom.rvs(0.8, size=100). If instead we want 100 i.i.d. $FS(0.8)$ r.v.s, we just need to add 1 to include the success: geom.rvs(0.8, size=100) + 1.

In [2]:

# seed the random number generator
np.random.seed(610)

from scipy.stats import geom

# to learn more about scipy.stats.geom, un-comment ouf the following line
#print(geom.__doc__)

print('geom.pmf(3, 0.5) = {}'.format(geom.pmf(3, 0.5)))

print('\ngeom.cdf(3, 0.5) = {}'.format(geom.cdf(3, 0.5)))

print('\ngeom.rvs(0.8, size=100) = {}'.format(geom.rvs(0.8, size=100)))

print('\ngeom.rvs(0.8, size=100) + 1 = {}'.format(geom.rvs(0.8, size=100)+1))

geom.pmf(3, 0.5) = 0.125

geom.cdf(3, 0.5) = 0.875

geom.rvs(0.8, size=100) = [1 1 1 1 1 1 1 1 2 2 1 1 2 1 2 1 3 1 1 2 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 3 1 2 1 1 1 1 3 1 1 2 1 1 1 1 1
 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2]

geom.rvs(0.8, size=100) + 1 = [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 4 2 2 3 2 2 3 2 2 3 2
 2 3 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 4 2 2 2 2 3 2 2 2]

For the Negative Binomial distribution, we have scipy.stats.nbinom pmf, cdf, and rvs. These take three inputs. For example, to calculate the $NBin(5, 0.5)$ PMF at 3, we type nbimom.pmf(3, 5, 0.5).

In [3]:

from scipy.stats import nbinom

# to learn more about scipy.stats.nbinom, un-comment ouf the following line
#print(nbinom.__doc__)

print('nbinom.pmf(3, 5, 0.5) = {}'.format(nbinom.pmf(3, 5, 0.5)))

nbinom.pmf(3, 5, 0.5) = 0.1367187500000001

Finally, for the Poisson distribution and scipy.stats.poisson, the three functions are pmf, cdf, and rvs. These take two inputs. For example, to find the $Pois(10)$ CDF at 2, we type poisson.cdf(2, 10).

In [4]:

from scipy.stats import poisson

# to learn more about scipy.stats.poisson, un-comment ouf the following line
#print(poisson.__doc__)

print('poisson.cdf(2, 10) = {}'.format(poisson.cdf(2, 10)))

poisson.cdf(2, 10) = 0.0027693957155115775

Matching simulation¶

Continuing with Example 4.4.4, let's use simulation to calculate the expected number of matches in a deck of cards. As in Chapter 1, we let $n$ be the number of cards in the deck and perform the experiment 10⁴ times using iteration with a for-loop.

In [5]:

np.random.seed(987)

n = 100
trials = 10**4

ordered = np.arange(1, n+1)

r = []
for i in range(trials):    
    shuffled = np.random.permutation(np.arange(1, n+1))
    m = np.sum(shuffled == ordered)
    r.append(m)

Now $r$ contains the number of matches from each of the 10⁴ simulations. But instead of looking at the probability of at least one match, as in Chapter 1, we now want to find the expected number of matches. We can approximate this by the average of all the simulation results, that is, the arithmetic mean of the elements of $r$. This is accomplished with the numpy.mean function:

In [6]:

np.mean(r)

Out[6]:

1.0127

The command numpy.mean(r) is equivalent to numpy.sum(r)/len(r). The result we get is very close to 1, confirming the calculation we did in Example 4.4.4 using indicator r.v.s. You can verify that no matter what value of $n$ you choose, numpy.mean(r) will be very close to 1.

Distinct birthdays simulation¶

Let's calculate the expected number of distinct birthdays in a group of $k$ people by simulation. We'll let $k = 20$, but you can choose whatever value of $k$ you like.

In [7]:

np.random.seed(1597)

k = 20
trials = 10**4

r = []
for i in range(trials):
    bdays = np.random.choice(np.arange(1,365+1), k)
    uniqs = len(np.unique(bdays))
    r.append(uniqs)

We use a for loop to iterate 10⁴ times, so we just need to understand what is inside the body of the for loop. First, we sample k times with replacement from the numbers 1 through 365 and call these the birthdays of the k people, bdays. Then, numpy.unique(bdays) removes duplicates in the array bdays, and len(numpy.unique(bdays)) returns the length of the array after duplicates have been removed (number of unique birthdays). This count of unique birthdays is appended to array r.

Now r contains the number of distinct birthdays that we observed in each of the 10⁴ simulations. The average number of distinct birthdays across the 10⁴ simulations is numpy.mean(r). We can compare the simulated value to the theoretical value that we found in Example 4.4.5 using indicator r.v.s:

In [8]:

simulated = np.mean(r)
theoretical = 365*(1-(364/365)**k)

print('simulated: {}'.format(simulated))
print('theoretical: {}'.format(theoretical))     

simulated: 19.4921
theoretical: 19.487910239138333

When we ran the code, both the simulated and theoretical values gave us approximately 19.5.