Chapter 1: Probability and counting

This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 27 - 30, Introduction to Probability, 1st Edition, Blitzstein & Hwang.


In [1]:
import numpy as np

Vectors

Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a NumPy array.

In Python, you can import the numpy module using the np shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results.

Here we pass in a list of integers to the numpy.array function to create an instance of a NumPy array, and then print the array as follows:

In [2]:
v = np.array([3, 1, 4, 5, 9])
print(v)
[3 1 4 5 9]

The numpy module (shortened to the alias np) provides many convenient object instance methods that work on an array. It is possible to use the functions of the numpy module, or invoke these functions as object instance methods of an array. Here are a few examples:

  • v.sum() - sum array elements; equivalent to numpy.sum(v)
  • v.max() - return the maximum value of the array; equivalent to numpy.max(v)
  • v.min() - return the maximum value of the array; equivalent to numpy.min(v)
In [3]:
print('sum of v =', v.sum())

print('maximum of v =', v.max())

print('minimum of v =', v.min())
sum of v = 22
maximum of v = 9
minimum of v = 1

To obtain the number of elements in the array, you can use either:

  • the Python built-in len function, passing in the array
  • the size attribute of the NumPy array
In [4]:
print('len(v) =', len(v))

print('v.size =', v.size)
len(v) = 5
v.size = 5

To create a NumPy array from the sequence $(1, 2, \dots, n)$, use the numpy.arange function. In general, an ascending-order sequence $(m, \dots, n)$ can be generated with arange(m, n+1). Note that the second parameter stop to the arange function is not inclusive, so for an ascending-order sequence you must increment stop by 1.

A descending-order sequence can be generated by providing -1 as the third parameter step to the arange function. However, for the reason mentioned above, you must conversely decrement the second parameter stop if you want to include this value in the array.

In [5]:
m = 1
n = 10

# example of ascending order sequence from m to n, inclusive
v = np.arange(m, n+1)
print(v)

## example of descending order sequence
v_reversed = np.arange(n, m-1, -1)
print(v_reversed)
[ 1  2  3  4  5  6  7  8  9 10]
[10  9  8  7  6  5  4  3  2  1]

Like a Python list (and unlike vectors in R), NumPy array is zero-indexed. To access the ith element of an array named v, use v[i-1].

In [6]:
print('1st element (index 0) of v is', v[0])

print('10th element (index 9) of v =', v[9])
1st element (index 0) of v is 1
10th element (index 9) of v = 10

To access a subset of the array members, you can pass another NumPy array of the target indices to the indexer of your array.

You can also use a NumPy array of boolean values (True, False) to index your array, keeping any elements where the index corresponds to True and filtering out elements where the corresponding index is False.

In [7]:
# NumPy array for the indices 1, 3, 5
subset_ind = np.array([1, 3, 5])
print('elements of v at indices 1, 3 and 5: ', v[subset_ind])

# boolean indexer: all True indices will be included in the subset,
# while all False indexes will be filtered out
# keep all EXCEPT for array values at indices 2, 4, and 6
boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])
print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])
elements of v at indices 1, 3 and 5:  [2 4 6]
elements of v except at indices 2, 4 and 6:  [ 1  2  4  6  8  9 10]

Many operations on NumPy array are interpreted component-wise. For example, in math the cube of a vector doesn't have a standard definition, but

v**3

simply cubes each element individually.

In [8]:
print(v**3)
[   1    8   27   64  125  216  343  512  729 1000]

Similarly,

1 / np.arange(1, 101)**2

is a very compact way to get the vector $1, \frac{1}{2^2}, \frac{1}{3^2}, \ldots , \frac{1}{100^2}$

In [9]:
1 / np.arange(1, 101)**2
Out[9]:
array([1.00000000e+00, 2.50000000e-01, 1.11111111e-01, 6.25000000e-02,
       4.00000000e-02, 2.77777778e-02, 2.04081633e-02, 1.56250000e-02,
       1.23456790e-02, 1.00000000e-02, 8.26446281e-03, 6.94444444e-03,
       5.91715976e-03, 5.10204082e-03, 4.44444444e-03, 3.90625000e-03,
       3.46020761e-03, 3.08641975e-03, 2.77008310e-03, 2.50000000e-03,
       2.26757370e-03, 2.06611570e-03, 1.89035917e-03, 1.73611111e-03,
       1.60000000e-03, 1.47928994e-03, 1.37174211e-03, 1.27551020e-03,
       1.18906064e-03, 1.11111111e-03, 1.04058273e-03, 9.76562500e-04,
       9.18273646e-04, 8.65051903e-04, 8.16326531e-04, 7.71604938e-04,
       7.30460190e-04, 6.92520776e-04, 6.57462196e-04, 6.25000000e-04,
       5.94883998e-04, 5.66893424e-04, 5.40832883e-04, 5.16528926e-04,
       4.93827160e-04, 4.72589792e-04, 4.52693526e-04, 4.34027778e-04,
       4.16493128e-04, 4.00000000e-04, 3.84467512e-04, 3.69822485e-04,
       3.55998576e-04, 3.42935528e-04, 3.30578512e-04, 3.18877551e-04,
       3.07787011e-04, 2.97265161e-04, 2.87273772e-04, 2.77777778e-04,
       2.68744961e-04, 2.60145682e-04, 2.51952633e-04, 2.44140625e-04,
       2.36686391e-04, 2.29568411e-04, 2.22766763e-04, 2.16262976e-04,
       2.10039908e-04, 2.04081633e-04, 1.98373339e-04, 1.92901235e-04,
       1.87652468e-04, 1.82615047e-04, 1.77777778e-04, 1.73130194e-04,
       1.68662506e-04, 1.64365549e-04, 1.60230732e-04, 1.56250000e-04,
       1.52415790e-04, 1.48720999e-04, 1.45158949e-04, 1.41723356e-04,
       1.38408304e-04, 1.35208221e-04, 1.32117849e-04, 1.29132231e-04,
       1.26246686e-04, 1.23456790e-04, 1.20758363e-04, 1.18147448e-04,
       1.15620303e-04, 1.13173382e-04, 1.10803324e-04, 1.08506944e-04,
       1.06281220e-04, 1.04123282e-04, 1.02030405e-04, 1.00000000e-04])

In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy array, the shorter vector gets broadcast. For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as v, thus adding 3 to each element of $v$.

In [10]:
print(v + 3)
[ 4  5  6  7  8  9 10 11 12 13]

Factorials and binomial coefficients

We can compute $n!$ using scipy.special.factorial(n) and $\binom{n}{k}$ using scipy.special.comb(n, k). As we have seen, factorials grow extremely quickly. What is the largest $n$ for which scipy.special.factorial returns a number? Beyond that point, scipy.special.factorial will return inf (infinity), without a warning message.

In [11]:
from scipy.special import factorial, comb

# to learn more about scipy.special.factorial, un-comment out the following line
#print(factorial.__doc__)

# to learn more about scipy.special.comb, un-comment out the following line
#print(comb.__doc__)

print('5! =', factorial(5))

print('6 choose 2 =', comb(6, 2))

print('200! =', factorial(200), '?')
5! = 120.0
6 choose 2 = 15.0
200! = inf ?

But it is still possible to use scipy.special.gammaln define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \binom{n}{k}$.

In [12]:
from scipy.special import gammaln

# to learn more about scipy.special.gammaln, un-comment out the following line
#print(gammaln.__doc__)

def logfactorial(n):
    return gammaln(n+1)

def logchoose(n, k):
    num = logfactorial(n)
    denom = logfactorial(n-k) + logfactorial(k)
    return num - denom

print('log(200!) =', logfactorial(200))

print('log 100000 choose 999 =', logchoose(100000, 999))
log(200!) = 863.2319871924054
log 100000 choose 999 = 5591.190431187046

Sampling and simulation

The function numpy.random.choice is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they "look like" random samples for almost all practical purposes.) For example,

In [13]:
# seed the random number generator
np.random.seed(1)

# Example: sampling without replacement
#
# do not forget that Python arrays are zero-indexed,
# and the 2nd argument to NumPy arange must be incremented by 1
# if you want to include that value
n = 10
k = 5
np.random.choice(np.arange(1, n+1), k, replace=False)
Out[13]:
array([ 3, 10,  7,  5,  1])

generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify replace=True, or you may leave that argument out altogether since the default for numpy.random.choice is replace=True.

In [14]:
np.random.seed(1)

# Example: sampling with replacement
np.random.choice(np.arange(1, n+1), k, replace=True)
Out[14]:
array([ 6,  9, 10,  6,  1])

To obtain a random permutation of an array of numbers $1, 2, \ldots, n$ we can use numpy.random.shuffle. Note that this function operates on the given array in-place.

In [15]:
np.random.seed(2)

m = 1
n = 10

v = np.arange(m, n+1)
print('v =', v)

np.random.shuffle(v)
print('v, shuffled =', v)
v = [ 1  2  3  4  5  6  7  8  9 10]
v, shuffled = [ 5  2  6  1  8  3  4  7 10  9]

We can also use numpy.random.choice to draw from a non-numeric list or array. For example, the Python built-in function list can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. numpy.random.choice will generate a random 7-letter "word" by sampling from the alphabet list, without replacement. Lastly, the Python String function join concatenates the 7 randomly selected letters into a "word".

In [16]:
np.random.seed(3)

# split string of lower-case alphabets into an array
alpha = list('abcdefghijklmnopqrstuvwxyz')

# randomly choose 7 letters, concatenate into a string
''.join(np.random.choice(alpha, 7, replace=False))
Out[16]:
'srmxpqn'

numpy.random.choice also allows us to specify general probabilities for sampling each number. For example,

In [17]:
np.random.seed(5)

# from the 4 numbers starting from 0
# obtain a sample of size 3
# with replacement
# using the probabilities listed in p
np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])
Out[17]:
array([1, 3, 1], dtype=int64)

samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter p=[0.1, 0.2, 0.3, 0.4]. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is proportional to its original probability.

Matching problem simulation

Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using numpy.random.permutation while iterating in a for-loop (see Python flow controls for and range), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:

In [18]:
np.random.seed(8)

n = 100
trials = 10000

ordered = np.arange(1, n+1)

tmp = []
for i in range(trials):    
    shuffled = np.random.permutation(np.arange(1, n+1))
    m = np.sum(shuffled == ordered)
    tmp.append(m)
    
results = np.array(tmp)
ans = np.sum(results >= 1) / trials

expected = 1 - 1/np.e

print('simulated value: {:.2F}'.format(ans))
print('expected value : {:.2F}'.format(expected))
simulated value: 0.62
expected value : 0.63

First, we declare and assign values to variables for the size of the deck n, and the number of trials in our simulation.

Next, we generate a sequence from 1 to n (stopping at n+1 to include n) to represent our ordered deck of cards.

The code then loops for trial number of times, where

  • a permutation of a new sequence from 1 to n is created
  • the number of cards (indices) that match with our ordered sequence are counted as m
  • the number of matches m are saved to a temporary accumulator array tmp

After completing trial simulations, we create a NumPy array results from the tmp accumulator, which lets us count the number of simulations where there was at least 1 match.

Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations.

Birthday problem calculation and simulation

The following code uses the NumPy module functions numpy.prod (which gives the product of the elements of the given array) and numpy.sum to calculate the probability of at least one birthday match in a group of 23 people:

In [19]:
k = 23
days_in_year = 365

numer = np.arange(days_in_year, days_in_year-k, -1)
denom = np.array([days_in_year]*k)

1.0 - np.sum(np.prod(numer/denom))
Out[19]:
0.5072972343239857

Unfortunately, NumPy and SciPy do not have library functions such as pbirthday and qbirthday for computing birthday probabilities as does R. However, you can easily compose your own functions.

In [20]:
def pbirthday_naive(n):
    """ Return the probability of at least 1 matching pair of birthdays out of n people.
        Assumes 365-day year.
    
        This is a naive implementation that may overflow or error under certain input.
    """
    days_in_year = 365
    if n < 2:
        return 0.0
    elif n > days_in_year:
        return 1.0
    else:
        numer = np.arange(days_in_year, days_in_year-n, -1)
        denom = np.array([days_in_year]*n)
        return 1.0 - np.sum(np.prod( numer/denom ))

def qbirthday_naive(p):
    """ Return an approximation of the number of people required to have at least 1 birthday
        match with probability p.
        
        This naive implemention is based on the Birthday problem article on Wikipedia.
        https://en.wikipedia.org/wiki/Birthday_problem
    """
    raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )
    return np.ceil(raw).astype(np.int32)

n = 23
pbirthday_naive_ans = pbirthday_naive(n)
print(('{:.1f}% chance of at least 1 birthday match '
       'among {} people'.format(pbirthday_naive_ans*100.0, n)))

p = 0.50
qbirthday_naive_ans = qbirthday_naive(p)
print(('For an approximately {:.1f}% chance of at least 1 birthday match, '
       'we need {} people'.format(p*100.0, qbirthday_naive_ans)))
50.7% chance of at least 1 birthday match among 23 people
For an approximately 50.0% chance of at least 1 birthday match, we need 23 people

We can also find the probability of having at least one triple birthday match, i.e., three people with the same birthday; with the pbirthday and qbirthday functions below, all we have to do is add coincident=3 to say we're looking for triple matches. For example, pbirthday(23 ,coincident=3) returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. qbirthday(0.5, coincident=3) returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match.

In [21]:
def pbirthday(n, classes=365, coincident=2):
    """ Return the probability of a coincidence of a birthday match among n people.
    
        Python implementation of R's pbirthday {stats}.
    """
    k = coincident
    c = classes
    
    if k < 2:
        return 1.0
    if k == 2:
        numer = np.arange(c, c-n, -1)
        denom = np.array([c]*n)
        return 1.0 - np.sum(np.prod( numer/denom ))
    if k > n: 
        return 0.0
    if n > c * (k - 1):
        return 1.0
    else:
        lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)
        lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)
        return -np.expm1(-np.exp(lxx))

def qbirthday(prob=0.5, classes=365, coincident=2):
    """ Return the smallest number of people needed to have at least
        the specified probability of coincidence.
    
        Python implementation of R's qbirthday {stats}.
    """
    k = coincident
    c = classes
    p = prob
    
    if p <= 0.0:
        return 1
    if p >= 1.0:
        return c * (k - 1) + 1
    
    N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)
    N = np.ceil(N).astype(np.int32)
    if pbirthday(N, c, k) < p:
        N += 1
        while pbirthday(N, c, k) < p:
            N += 1
    elif pbirthday(N - 1, c, k) >= p:
        N -= 1
        while pbirthday(N - 1, c, k) >= p:
            N -= 1
    return N

n = 23
pbirthday_ans = pbirthday(n, coincident=3)
print(('{:.1F}% chance of triple birthday match '
       'among {} people'.format(pbirthday_ans*100.0, n)))

p = 0.5
qbirthday_ans = qbirthday(p, coincident=3)
print(('We need {} people for a triple match '
       'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))
1.4% chance of triple birthday match among 23 people
We need 88 people for a triple match with probability of 50.0%

To simulate the birthday problem, we can use

In [22]:
np.random.seed(13)

b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)
u, c = np.unique(b, return_counts=True)

to generate random birthdays for 23 people and then use numpy.unique, specifiying return_counts=True, to tally up the counts of how many people were born on each day. We can run 104 repetitions as follows:

In [23]:
np.random.seed(21)

n = 23
trials = 10**4

ordered = np.arange(1, n+1)

m = 0
for i in range(trials):    
    b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)
    u, c = np.unique(b, return_counts=True)
    if np.sum(c > 1):
        m += 1

p =  m / trials  
print(('In a simulation with {} trials, ' 
       'the probability of a birthday match with {} people '
       'is {}'.format(trials, n, p)))
In a simulation with 10000 trials, the probability of a birthday match with 23 people is 0.5118

If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since numpy.random.choice allows us to specify a list p of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day).


© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science).