This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 29 - 32, Introduction to Probability, Second Edition, Blitzstein & Hwang.
import numpy as np
Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a NumPy array.
In Python, you can import the numpy
module using the np
shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results.
Here we pass in a list of integers to the numpy.array
function to create an instance of a NumPy array, and then print the array as follows:
v = np.array([3, 1, 4, 5, 9])
print(v)
[3 1 4 5 9]
The numpy
module (shortened to the alias np
) provides many convenient object instance methods that work on an array
. It is possible to use the functions of the numpy
module, or invoke these functions as object instance methods of an array
. Here are a few examples:
print('sum of v =', v.sum())
print('maximum of v =', v.max())
print('minimum of v =', v.min())
sum of v = 22 maximum of v = 9 minimum of v = 1
print('len(v) =', len(v))
print('v.size =', v.size)
len(v) = 5 v.size = 5
To create a NumPy array
from the sequence $(1, 2, \dots, n)$, use the numpy.arange
function. In general, an ascending-order sequence $(m, \dots, n)$ can be generated with arange(m, n+1)
. Note that the second parameter stop
to the arange
function is not inclusive, so for an ascending-order sequence you must increment stop
by 1.
A descending-order sequence can be generated by providing -1 as the third parameter step
to the arange
function. However, for the reason mentioned above, you must conversely decrement the second parameter stop
if you want to include this value in the array
.
m = 1
n = 10
# example of ascending order sequence from m to n, inclusive
v = np.arange(m, n+1)
print(v)
## example of descending order sequence
v_reversed = np.arange(n, m-1, -1)
print(v_reversed)
[ 1 2 3 4 5 6 7 8 9 10] [10 9 8 7 6 5 4 3 2 1]
Like a Python list
(and unlike vectors in R), NumPy array
is zero-indexed. To access the i
th element of an array
named v
, use v[i-1]
.
print('1st element (index 0) of v is', v[0])
print('10th element (index 9) of v =', v[9])
1st element (index 0) of v is 1 10th element (index 9) of v = 10
To access a subset of the array
members, you can pass another NumPy array
of the target indices to the indexer of your array
.
You can also use a NumPy array
of boolean values (True
, False
) to index your array
, keeping any elements where the index corresponds to True
and filtering out elements where the corresponding index is False
.
# NumPy array for the indices 1, 3, 5
subset_ind = np.array([1, 3, 5])
print('elements of v at indices 1, 3 and 5: ', v[subset_ind])
# boolean indexer: all True indices will be included in the subset,
# while all False indexes will be filtered out
# keep all EXCEPT for array values at indices 2, 4, and 6
boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])
print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])
elements of v at indices 1, 3 and 5: [2 4 6] elements of v except at indices 2, 4 and 6: [ 1 2 4 6 8 9 10]
Many operations on NumPy array
are interpreted component-wise. For example, in math the cube of a vector doesn't have a standard definition, but
v**3
simply cubes each element individually.
print(v**3)
[ 1 8 27 64 125 216 343 512 729 1000]
Similarly,
1 / np.arange(1, 101)**2
is a very compact way to get the vector $1, \frac{1}{2^2}, \frac{1}{3^2}, \ldots , \frac{1}{100^2}$
1 / np.arange(1, 101)**2
array([1.00000000e+00, 2.50000000e-01, 1.11111111e-01, 6.25000000e-02, 4.00000000e-02, 2.77777778e-02, 2.04081633e-02, 1.56250000e-02, 1.23456790e-02, 1.00000000e-02, 8.26446281e-03, 6.94444444e-03, 5.91715976e-03, 5.10204082e-03, 4.44444444e-03, 3.90625000e-03, 3.46020761e-03, 3.08641975e-03, 2.77008310e-03, 2.50000000e-03, 2.26757370e-03, 2.06611570e-03, 1.89035917e-03, 1.73611111e-03, 1.60000000e-03, 1.47928994e-03, 1.37174211e-03, 1.27551020e-03, 1.18906064e-03, 1.11111111e-03, 1.04058273e-03, 9.76562500e-04, 9.18273646e-04, 8.65051903e-04, 8.16326531e-04, 7.71604938e-04, 7.30460190e-04, 6.92520776e-04, 6.57462196e-04, 6.25000000e-04, 5.94883998e-04, 5.66893424e-04, 5.40832883e-04, 5.16528926e-04, 4.93827160e-04, 4.72589792e-04, 4.52693526e-04, 4.34027778e-04, 4.16493128e-04, 4.00000000e-04, 3.84467512e-04, 3.69822485e-04, 3.55998576e-04, 3.42935528e-04, 3.30578512e-04, 3.18877551e-04, 3.07787011e-04, 2.97265161e-04, 2.87273772e-04, 2.77777778e-04, 2.68744961e-04, 2.60145682e-04, 2.51952633e-04, 2.44140625e-04, 2.36686391e-04, 2.29568411e-04, 2.22766763e-04, 2.16262976e-04, 2.10039908e-04, 2.04081633e-04, 1.98373339e-04, 1.92901235e-04, 1.87652468e-04, 1.82615047e-04, 1.77777778e-04, 1.73130194e-04, 1.68662506e-04, 1.64365549e-04, 1.60230732e-04, 1.56250000e-04, 1.52415790e-04, 1.48720999e-04, 1.45158949e-04, 1.41723356e-04, 1.38408304e-04, 1.35208221e-04, 1.32117849e-04, 1.29132231e-04, 1.26246686e-04, 1.23456790e-04, 1.20758363e-04, 1.18147448e-04, 1.15620303e-04, 1.13173382e-04, 1.10803324e-04, 1.08506944e-04, 1.06281220e-04, 1.04123282e-04, 1.02030405e-04, 1.00000000e-04])
In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy array
, the shorter vector gets broadcast. For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as v
, thus adding 3 to each element of $v$.
print(v + 3)
[ 4 5 6 7 8 9 10 11 12 13]
We can compute $n!$ using scipy.special.factorial(n)
and $\binom{n}{k}$ using scipy.special.comb(n, k)
. As we have seen, factorials grow extremely quickly. What is the largest $n$ for which scipy.special.factorial
returns a number? Beyond that point, scipy.special.factorial
will return inf
(infinity), without a warning message.
from scipy.special import factorial, comb
# to learn more about scipy.special.factorial, un-comment out the following line
#print(factorial.__doc__)
# to learn more about scipy.special.comb, un-comment out the following line
#print(comb.__doc__)
print('5! =', factorial(5))
print('6 choose 2 =', comb(6, 2))
print('200! =', factorial(200), '?')
5! = 120.0 6 choose 2 = 15.0 200! = inf ?
But it is still possible to use scipy.special.gammaln
define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \binom{n}{k}$.
from scipy.special import gammaln
# to learn more about scipy.special.gammaln, un-comment out the following line
#print(gammaln.__doc__)
def logfactorial(n):
return gammaln(n+1)
def logchoose(n, k):
num = logfactorial(n)
denom = logfactorial(n-k) + logfactorial(k)
return num - denom
print('log(200!) =', logfactorial(200))
print('log 100000 choose 999 =', logchoose(100000, 999))
log(200!) = 863.2319871924054 log 100000 choose 999 = 5591.190431187046
The function numpy.random.choice
is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they "look like" random samples for almost all practical purposes.) For example,
# seed the random number generator
np.random.seed(1)
# Example: sampling without replacement
#
# do not forget that Python arrays are zero-indexed,
# and the 2nd argument to NumPy arange must be incremented by 1
# if you want to include that value
n = 10
k = 5
np.random.choice(np.arange(1, n+1), k, replace=False)
array([ 3, 10, 7, 5, 1])
generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify replace=True
, or you may leave that argument out altogether since the default for numpy.random.choice
is replace=True
.
np.random.seed(1)
# Example: sampling with replacement
np.random.choice(np.arange(1, n+1), k, replace=True)
array([ 6, 9, 10, 6, 1])
To obtain a random permutation of an array
of numbers $1, 2, \ldots, n$ we can use numpy.random.shuffle
. Note that this function operates on the given array
in-place.
np.random.seed(2)
m = 1
n = 10
v = np.arange(m, n+1)
print('v =', v)
np.random.shuffle(v)
print('v, shuffled =', v)
v = [ 1 2 3 4 5 6 7 8 9 10] v, shuffled = [ 5 2 6 1 8 3 4 7 10 9]
We can also use numpy.random.choice
to draw from a non-numeric list
or array
. For example, the Python built-in function list
can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. numpy.random.choice
will generate a random 7-letter "word" by sampling from the list of lowercase alphabet chars derived from string.ascii_lowercase
, without replacement. Lastly, the Python String function join
concatenates the 7 randomly selected letters into a "word".
np.random.seed(3)
import string
# split string of lower-case alphabets into an array
alpha = list(string.ascii_lowercase)
# randomly choose 7 letters, concatenate into a string
''.join(np.random.choice(alpha, 7, replace=False))
'srmxpqn'
numpy.random.choice
also allows us to specify general probabilities for sampling each number. For example,
np.random.seed(5)
# from the 4 numbers starting from 0
# obtain a sample of size 3
# with replacement
# using the probabilities listed in p
np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])
array([1, 3, 1], dtype=int64)
samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter p=[0.1, 0.2, 0.3, 0.4]
. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is proportional to its original probability.
Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using numpy.random.permutation
while iterating in a for-loop (see Python flow controls for
and range
), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:
np.random.seed(8)
n = 100
trials = 10000
ordered = np.arange(1, n+1)
tmp = []
for i in range(trials):
shuffled = np.random.permutation(np.arange(1, n+1))
m = np.sum(shuffled == ordered)
tmp.append(m)
results = np.array(tmp)
ans = np.sum(results >= 1) / trials
expected = 1 - 1/np.e
print('simulated value: {:.2F}'.format(ans))
print('expected value : {:.2F}'.format(expected))
simulated value: 0.62 expected value : 0.63
First, we declare and assign values to variables for the size of the deck n
, and the number of trials
in our simulation.
Next, we generate a sequence from 1 to n
(stopping at n+1
to include n
) to represent our ordered deck of cards.
The code then loops for trial
number of times, where
n
is createdordered
sequence are counted as m
m
are saved to a temporary accumulator array tmp
After completing trial
simulations, we create a NumPy array
results
from the tmp
accumulator, which lets us count the number of simulations where there was at least 1 match.
Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations.
The following code uses the NumPy module functions numpy.prod
(which gives the product of the elements of the given array
) and numpy.sum
to calculate the probability of at least one birthday match in a group of 23 people:
k = 23
days_in_year = 365
numer = np.arange(days_in_year, days_in_year-k, -1)
denom = np.array([days_in_year]*k)
1.0 - np.sum(np.prod(numer/denom))
0.5072972343239857
Unfortunately, NumPy and SciPy do not have library functions such as pbirthday
and qbirthday
for computing birthday probabilities as does R. However, you can easily compose your own functions.
def pbirthday_naive(n):
""" Return the probability of at least 1 matching pair of birthdays out of n people.
Assumes 365-day year.
This is a naive implementation that may overflow or error under certain input.
"""
days_in_year = 365
if n < 2:
return 0.0
elif n > days_in_year:
return 1.0
else:
numer = np.arange(days_in_year, days_in_year-n, -1)
denom = np.array([days_in_year]*n)
return 1.0 - np.sum(np.prod( numer/denom ))
def qbirthday_naive(p):
""" Return an approximation of the number of people required to have at least 1 birthday
match with probability p.
This naive implemention is based on the Birthday problem article on Wikipedia.
https://en.wikipedia.org/wiki/Birthday_problem
"""
raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )
return np.ceil(raw).astype(np.int32)
n = 23
pbirthday_naive_ans = pbirthday_naive(n)
print(('{:.1f}% chance of at least 1 birthday match '
'among {} people'.format(pbirthday_naive_ans*100.0, n)))
p = 0.50
qbirthday_naive_ans = qbirthday_naive(p)
print(('For an approximately {:.1f}% chance of at least 1 birthday match, '
'we need {} people'.format(p*100.0, qbirthday_naive_ans)))
50.7% chance of at least 1 birthday match among 23 people For an approximately 50.0% chance of at least 1 birthday match, we need 23 people
We can also find the probability of having at least one triple birthday match, i.e., three people with the same birthday; with the pbirthday
and qbirthday
functions below, all we have to do is add coincident=3
to say we're looking for triple matches. For example, pbirthday(23, coincident=3)
returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. qbirthday(0.5, coincident=3)
returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match.
def pbirthday(n, classes=365, coincident=2):
""" Return the probability of a coincidence of a birthday match among n people.
Python implementation of R's pbirthday {stats}.
"""
k = coincident
c = classes
if k < 2:
return 1.0
if k == 2:
numer = np.arange(c, c-n, -1)
denom = np.array([c]*n)
return 1.0 - np.sum(np.prod( numer/denom ))
if k > n:
return 0.0
if n > c * (k - 1):
return 1.0
else:
lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)
lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)
return -np.expm1(-np.exp(lxx))
def qbirthday(prob=0.5, classes=365, coincident=2):
""" Return the smallest number of people needed to have at least
the specified probability of coincidence.
Python implementation of R's qbirthday {stats}.
"""
k = coincident
c = classes
p = prob
if p <= 0.0:
return 1
if p >= 1.0:
return c * (k - 1) + 1
N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)
N = np.ceil(N).astype(np.int32)
if pbirthday(N, c, k) < p:
N += 1
while pbirthday(N, c, k) < p:
N += 1
elif pbirthday(N - 1, c, k) >= p:
N -= 1
while pbirthday(N - 1, c, k) >= p:
N -= 1
return N
n = 23
pbirthday_ans = pbirthday(n, coincident=3)
print(('{:.1F}% chance of triple birthday match '
'among {} people'.format(pbirthday_ans*100.0, n)))
p = 0.5
qbirthday_ans = qbirthday(p, coincident=3)
print(('We need {} people for a triple match '
'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))
1.4% chance of triple birthday match among 23 people We need 88 people for a triple match with probability of 50.0%
To simulate the birthday problem, we can use
np.random.seed(13)
b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)
u, c = np.unique(b, return_counts=True)
to generate random birthdays for 23 people and then use numpy.unique
, specifiying return_counts=True
, to tally up the counts of how many people were born on each day. We can run 104 repetitions as follows:
np.random.seed(21)
n = 23
trials = 10**4
ordered = np.arange(1, n+1)
m = 0
for i in range(trials):
b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)
u, c = np.unique(b, return_counts=True)
if np.sum(c > 1):
m += 1
p = m / trials
print(('In a simulation with {} trials, '
'the probability of a birthday match with {} people '
'is {}'.format(trials, n, p)))
In a simulation with 10000 trials, the probability of a birthday match with 23 people is 0.5118
If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since numpy.random.choice
allows us to specify a list p
of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day).
Joseph K. Blitzstein and Jessica Hwang, Harvard University and Stanford University, © 2019 by Taylor and Francis Group, LLC