This Jupyter notebook is the Python equivalent of the R code in section 4.11 R, pp. 192 - 194, Introduction to Probability, Second Edition, Blitzstein & Hwang.
import numpy as np
The three functions for the Geometric distribution in SciPy's scipy.stats.geom
are pmf
, cdf
, and rvs
, corresponding to the PMF, CDF, and random number generation. For pmf
and cdf
, we need to supply the following as inputs: (1) the value k
at which to evaluate the PMF or CDF, and (2) the parameter p
. For rvs
, we need to input (1) the number of random variables to generate and (2) the parameter p
.
For example, to calculate $P(X = 3)$ and $P(X \leq 3)$ where $X \sim Geom(0.5)$, we use geom.pmf(3, 0.5)
and geom.cdf(3, 0.5)
, respectively. To generate 100 i.i.d. $Geom(0.8)$ r.v.s, we use geom.rvs(0.8, size=100)
. If instead we want 100 i.i.d. $FS(0.8)$ r.v.s, we just need to add 1 to include the success: geom.rvs(0.8, size=100) + 1
.
# seed the random number generator
np.random.seed(610)
from scipy.stats import geom
# to learn more about scipy.stats.geom, un-comment ouf the following line
#print(geom.__doc__)
print('geom.pmf(3, 0.5) = {}'.format(geom.pmf(3, 0.5)))
print('\ngeom.cdf(3, 0.5) = {}'.format(geom.cdf(3, 0.5)))
print('\ngeom.rvs(0.8, size=100) = {}'.format(geom.rvs(0.8, size=100)))
print('\ngeom.rvs(0.8, size=100) + 1 = {}'.format(geom.rvs(0.8, size=100)+1))
geom.pmf(3, 0.5) = 0.125 geom.cdf(3, 0.5) = 0.875 geom.rvs(0.8, size=100) = [1 1 1 1 1 1 1 1 2 2 1 1 2 1 2 1 3 1 1 2 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 3 1 2 1 1 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2] geom.rvs(0.8, size=100) + 1 = [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 4 2 2 3 2 2 3 2 2 3 2 2 3 2 2 2 2 2 3 3 2 3 3 3 2 2 3 2 4 2 2 2 2 3 2 2 2]
For the Negative Binomial distribution, we have scipy.stats.nbinom
pmf
, cdf
, and rvs
. These take three inputs. For example, to calculate the $NBin(5, 0.5)$ PMF at 3, we type nbimom.pmf(3, 5, 0.5)
.
from scipy.stats import nbinom
# to learn more about scipy.stats.nbinom, un-comment ouf the following line
#print(nbinom.__doc__)
print('nbinom.pmf(3, 5, 0.5) = {}'.format(nbinom.pmf(3, 5, 0.5)))
nbinom.pmf(3, 5, 0.5) = 0.1367187500000001
Finally, for the Poisson distribution and scipy.stats.poisson
, the three functions are pmf
, cdf
, and rvs
. These take two inputs. For example, to find the $Pois(10)$ CDF at 2, we type poisson.cdf(2, 10)
.
from scipy.stats import poisson
# to learn more about scipy.stats.poisson, un-comment ouf the following line
#print(poisson.__doc__)
print('poisson.cdf(2, 10) = {}'.format(poisson.cdf(2, 10)))
poisson.cdf(2, 10) = 0.0027693957155115775
Continuing with Example 4.4.4, let's use simulation to calculate the expected number of matches in a deck of cards. As in Chapter 1, we let $n$ be the number of cards in the deck and perform the experiment 104 times using iteration with a for
-loop.
np.random.seed(987)
n = 100
trials = 10**4
ordered = np.arange(1, n+1)
r = []
for i in range(trials):
shuffled = np.random.permutation(np.arange(1, n+1))
m = np.sum(shuffled == ordered)
r.append(m)
Now $r$ contains the number of matches from each of the 104 simulations. But instead of looking at the probability of at least one match, as in Chapter 1, we now want to find the expected number of matches. We can approximate this by the average of all the simulation results, that is, the arithmetic mean of the elements of $r$. This is accomplished with the numpy.mean
function:
np.mean(r)
1.0127
The command numpy.mean(r)
is equivalent to numpy.sum(r)/len(r)
. The result we get is very close to 1, confirming the calculation we did in Example 4.4.4 using indicator r.v.s. You can verify that no matter what value of $n$ you choose, numpy.mean(r)
will be very close to 1.
Let's calculate the expected number of distinct birthdays in a group of $k$ people by simulation. We'll let $k = 20$, but you can choose whatever value of $k$ you like.
np.random.seed(1597)
k = 20
trials = 10**4
r = []
for i in range(trials):
bdays = np.random.choice(np.arange(1,365+1), k)
uniqs = len(np.unique(bdays))
r.append(uniqs)
We use a for loop to iterate 104 times, so we just need to understand what is inside the body of the for loop. First, we sample k
times with replacement from the numbers 1 through 365 and call these the birthdays of the k
people, bdays
. Then, numpy.unique(bdays)
removes duplicates in the array bdays
, and len(numpy.unique(bdays))
returns the length of the array after duplicates have been removed (number of unique birthdays). This count of unique birthdays is appended to array r
.
Now r
contains the number of distinct birthdays that we observed in each of the 104 simulations. The average number of distinct birthdays across the 104 simulations is numpy.mean(r)
. We can compare the simulated value to the theoretical value that we found in Example 4.4.5 using indicator r.v.s:
simulated = np.mean(r)
theoretical = 365*(1-(364/365)**k)
print('simulated: {}'.format(simulated))
print('theoretical: {}'.format(theoretical))
simulated: 19.4921 theoretical: 19.487910239138333
When we ran the code, both the simulated and theoretical values gave us approximately 19.5.
Joseph K. Blitzstein and Jessica Hwang, Harvard University and Stanford University, © 2019 by Taylor and Francis Group, LLC