This Jupyter notebook is the Python equivalent of the R code in section 1.8 R, pp. 27 - 30, Introduction to Probability, 1st Edition, Blitzstein & Hwang.

In [1]:

```
import numpy as np
```

Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a NumPy array.

In Python, you can import the `numpy`

module using the `np`

shorthand alias idiom to save on typing. Some of the examples later in this notebook rely on random number generation, so we will seed the NumPy pseudo-random number generator with an arbitrary value so that if you run a code block, you should see the exact same results.

Here we pass in a list of integers to the `numpy.array`

function to create an instance of a NumPy array, and then print the array as follows:

In [2]:

```
v = np.array([3, 1, 4, 5, 9])
print(v)
```

The `numpy`

module (shortened to the alias `np`

) provides many convenient object instance methods that work on an `array`

. It is possible to use the functions of the `numpy`

module, or invoke these functions as object instance methods of an `array`

. Here are a few examples:

In [3]:

```
print('sum of v =', v.sum())
print('maximum of v =', v.max())
print('minimum of v =', v.min())
```

In [4]:

```
print('len(v) =', len(v))
print('v.size =', v.size)
```

To create a NumPy `array`

from the sequence $(1, 2, \dots, n)$, use the `numpy.arange`

function. In general, an ascending-order sequence $(m, \dots, n)$ can be generated with `arange(m, n+1)`

. Note that the second parameter `stop`

to the `arange`

function is *not* inclusive, so for an ascending-order sequence you must increment `stop`

by 1.

A descending-order sequence can be generated by providing -1 as the third parameter `step`

to the `arange`

function. However, for the reason mentioned above, you must conversely *decrement* the second parameter `stop`

if you want to include this value in the `array`

.

In [5]:

```
m = 1
n = 10
# example of ascending order sequence from m to n, inclusive
v = np.arange(m, n+1)
print(v)
## example of descending order sequence
v_reversed = np.arange(n, m-1, -1)
print(v_reversed)
```

Like a Python `list`

(and unlike vectors in R), NumPy `array`

is zero-indexed. To access the `i`

^{th} element of an `array`

named `v`

, use `v[i-1]`

.

In [6]:

```
print('1st element (index 0) of v is', v[0])
print('10th element (index 9) of v =', v[9])
```

To access a subset of the `array`

members, you can pass another NumPy `array`

of the target indices to the indexer of your `array`

.

You can also use a NumPy `array`

of boolean values (`True`

, `False`

) to index your `array`

, keeping any elements where the index corresponds to `True`

and filtering out elements where the corresponding index is `False`

.

In [7]:

```
# NumPy array for the indices 1, 3, 5
subset_ind = np.array([1, 3, 5])
print('elements of v at indices 1, 3 and 5: ', v[subset_ind])
# boolean indexer: all True indices will be included in the subset,
# while all False indexes will be filtered out
# keep all EXCEPT for array values at indices 2, 4, and 6
boolean_ind = np.array([True, True, False, True, False, True, False, True, True, True])
print('elements of v except at indices 2, 4 and 6: ', v[boolean_ind])
```

Many operations on NumPy `array`

are interpreted *component-wise*. For example, in math the cube of a vector doesn't have a standard definition, but

```
v**3
```

simply cubes each element individually.

In [8]:

```
print(v**3)
```

Similarly,

```
1 / np.arange(1, 101)**2
```

is a very compact way to get the vector $1, \frac{1}{2^2}, \frac{1}{3^2}, \ldots , \frac{1}{100^2}$

In [9]:

```
1 / np.arange(1, 101)**2
```

Out[9]:

In math, $v+w$ is undefined if $v$ and $w$ are vectors of different lengths, but with a NumPy `array`

, the shorter vector gets broadcast. For example, $v+3$ essentially turns 3 into a vector of 3's of the same length as `v`

, thus adding 3 to each element of $v$.

In [10]:

```
print(v + 3)
```

We can compute $n!$ using `scipy.special.factorial(n)`

and $\binom{n}{k}$ using `scipy.special.comb(n, k)`

. As we have seen, factorials grow extremely quickly. What is the largest $n$ for which `scipy.special.factorial`

returns a number? Beyond that point, `scipy.special.factorial`

will return `inf`

(infinity), without a warning message.

In [11]:

```
from scipy.special import factorial, comb
# to learn more about scipy.special.factorial, un-comment out the following line
#print(factorial.__doc__)
# to learn more about scipy.special.comb, un-comment out the following line
#print(comb.__doc__)
print('5! =', factorial(5))
print('6 choose 2 =', comb(6, 2))
print('200! =', factorial(200), '?')
```

But it is still possible to use `scipy.special.gammaln`

define a function to easily compute $log(n!)$. Similarly, it would then be possible to compose a function that computes $log \binom{n}{k}$.

In [12]:

```
from scipy.special import gammaln
# to learn more about scipy.special.gammaln, un-comment out the following line
#print(gammaln.__doc__)
def logfactorial(n):
return gammaln(n+1)
def logchoose(n, k):
num = logfactorial(n)
denom = logfactorial(n-k) + logfactorial(k)
return num - denom
print('log(200!) =', logfactorial(200))
print('log 100000 choose 999 =', logchoose(100000, 999))
```

The function `numpy.random.choice`

is a useful way of drawing random samples in NumPy. (Technically, they are pseudo-random since there is an underlying deterministic algorithm, but they "look like" random samples for almost all practical purposes.) For example,

In [13]:

```
# seed the random number generator
np.random.seed(1)
# Example: sampling without replacement
#
# do not forget that Python arrays are zero-indexed,
# and the 2nd argument to NumPy arange must be incremented by 1
# if you want to include that value
n = 10
k = 5
np.random.choice(np.arange(1, n+1), k, replace=False)
```

Out[13]:

generates a random sample of 5 of the numbers from 1 to 10, without replacement, and with equal probabilities given to each number. To sample with replacement instead, you can explicitly specify `replace=True`

, or you may leave that argument out altogether since the default for `numpy.random.choice`

is `replace=True`

.

In [14]:

```
np.random.seed(1)
# Example: sampling with replacement
np.random.choice(np.arange(1, n+1), k, replace=True)
```

Out[14]:

To obtain a random permutation of an `array`

of numbers $1, 2, \ldots, n$ we can use `numpy.random.shuffle`

. Note that this function operates on the given `array`

in-place.

In [15]:

```
np.random.seed(2)
m = 1
n = 10
v = np.arange(m, n+1)
print('v =', v)
np.random.shuffle(v)
print('v, shuffled =', v)
```

We can also use `numpy.random.choice`

to draw from a non-numeric `list`

or `array`

. For example, the Python built-in function `list`

can be used to transform a string of the 26 lowercase letters of the English alphabet into a list of individual letters. `numpy.random.choice`

will generate a random 7-letter "word" by sampling from the list of lowercase alphabet chars derived from `string.ascii_lowercase`

, without replacement. Lastly, the Python String function `join`

concatenates the 7 randomly selected letters into a "word".

In [16]:

```
np.random.seed(3)
import string
# split string of lower-case alphabets into an array
alpha = list(string.ascii_lowercase)
# randomly choose 7 letters, concatenate into a string
''.join(np.random.choice(alpha, 7, replace=False))
```

Out[16]:

`numpy.random.choice`

also allows us to specify general probabilities for sampling each number. For example,

In [17]:

```
np.random.seed(5)
# from the 4 numbers starting from 0
# obtain a sample of size 3
# with replacement
# using the probabilities listed in p
np.random.choice(4, 3, replace=True, p=[0.1, 0.2, 0.3, 0.4])
```

Out[17]:

samples 3 numbers between 0 and 3, with replacement, and with probabilities given by the parameter `p=[0.1, 0.2, 0.3, 0.4]`

. If the sampling is without replacement, then at each stage the probability of any not-yet-chosen number is *proportional* to its original probability.

Let's show by simulation that the probability of a matching card in Example 1.6.4 is approximately $1 − 1/e$ when the deck is sufficiently large. Using `numpy.random.permutation`

while iterating in a for-loop (see Python flow controls `for`

and `range`

), we can perform the experiment a bunch of times and see how many times we encounter at least one matching card:

In [18]:

```
np.random.seed(8)
n = 100
trials = 10000
ordered = np.arange(1, n+1)
tmp = []
for i in range(trials):
shuffled = np.random.permutation(np.arange(1, n+1))
m = np.sum(shuffled == ordered)
tmp.append(m)
results = np.array(tmp)
ans = np.sum(results >= 1) / trials
expected = 1 - 1/np.e
print('simulated value: {:.2F}'.format(ans))
print('expected value : {:.2F}'.format(expected))
```

First, we declare and assign values to variables for the size of the deck `n`

, and the number of `trials`

in our simulation.

Next, we generate a sequence from 1 to `n`

(stopping at `n+1`

to include `n`

) to represent our ordered deck of cards.

The code then loops for `trial`

number of times, where

- a permutation of a new sequence from 1 to
`n`

is created - the number of cards (indices) that match with our
`ordered`

sequence are counted as`m`

- the number of matches
`m`

are saved to a temporary accumulator array`tmp`

After completing `trial`

simulations, we create a NumPy `array`

`results`

from the `tmp`

accumulator, which lets us count the number of simulations where there was at least 1 match.

Finally, we add up the number of times where there was at least one matching card, and we divide by the number of simulations.

The following code uses the NumPy module functions `numpy.prod`

(which gives the product of the elements of the given `array`

) and `numpy.sum`

to calculate the probability of at least one birthday match in a group of 23 people:

In [19]:

```
k = 23
days_in_year = 365
numer = np.arange(days_in_year, days_in_year-k, -1)
denom = np.array([days_in_year]*k)
1.0 - np.sum(np.prod(numer/denom))
```

Out[19]:

Unfortunately, NumPy and SciPy do not have library functions such as `pbirthday`

and `qbirthday`

for computing birthday probabilities as does R. However, you can easily compose your own functions.

In [20]:

```
def pbirthday_naive(n):
""" Return the probability of at least 1 matching pair of birthdays out of n people.
Assumes 365-day year.
This is a naive implementation that may overflow or error under certain input.
"""
days_in_year = 365
if n < 2:
return 0.0
elif n > days_in_year:
return 1.0
else:
numer = np.arange(days_in_year, days_in_year-n, -1)
denom = np.array([days_in_year]*n)
return 1.0 - np.sum(np.prod( numer/denom ))
def qbirthday_naive(p):
""" Return an approximation of the number of people required to have at least 1 birthday
match with probability p.
This naive implemention is based on the Birthday problem article on Wikipedia.
https://en.wikipedia.org/wiki/Birthday_problem
"""
raw = np.sqrt( 2 * 365 * np.log(1.0 / (1.0-p)) )
return np.ceil(raw).astype(np.int32)
n = 23
pbirthday_naive_ans = pbirthday_naive(n)
print(('{:.1f}% chance of at least 1 birthday match '
'among {} people'.format(pbirthday_naive_ans*100.0, n)))
p = 0.50
qbirthday_naive_ans = qbirthday_naive(p)
print(('For an approximately {:.1f}% chance of at least 1 birthday match, '
'we need {} people'.format(p*100.0, qbirthday_naive_ans)))
```

We can also find the probability of having at least one *triple birthday match*, i.e., three people with the same birthday; with the `pbirthday`

and `qbirthday`

functions below, all we have to do is add `coincident=3`

to say we're looking for triple matches. For example, `pbirthday(23, coincident=3)`

returns 0.014, so 23 people give us only a 1.4% chance of a triple birthday match. `qbirthday(0.5, coincident=3)`

returns 88, so we'd need 88 people to have at least a 50% chance of at least one triple birthday match.

In [21]:

```
def pbirthday(n, classes=365, coincident=2):
""" Return the probability of a coincidence of a birthday match among n people.
Python implementation of R's pbirthday {stats}.
"""
k = coincident
c = classes
if k < 2:
return 1.0
if k == 2:
numer = np.arange(c, c-n, -1)
denom = np.array([c]*n)
return 1.0 - np.sum(np.prod( numer/denom ))
if k > n:
return 0.0
if n > c * (k - 1):
return 1.0
else:
lhs = n * np.exp(-n/(c * k))/(1 - n/(c * (k + 1)))**(1/k)
lxx = k * np.log(lhs) - (k - 1) * np.log(c) - gammaln(k + 1)
return -np.expm1(-np.exp(lxx))
def qbirthday(prob=0.5, classes=365, coincident=2):
""" Return the smallest number of people needed to have at least
the specified probability of coincidence.
Python implementation of R's qbirthday {stats}.
"""
k = coincident
c = classes
p = prob
if p <= 0.0:
return 1
if p >= 1.0:
return c * (k - 1) + 1
N = np.exp(((k - 1) * np.log(c) + gammaln(k + 1) + np.log(-np.log1p(-p)))/k)
N = np.ceil(N).astype(np.int32)
if pbirthday(N, c, k) < p:
N += 1
while pbirthday(N, c, k) < p:
N += 1
elif pbirthday(N - 1, c, k) >= p:
N -= 1
while pbirthday(N - 1, c, k) >= p:
N -= 1
return N
n = 23
pbirthday_ans = pbirthday(n, coincident=3)
print(('{:.1F}% chance of triple birthday match '
'among {} people'.format(pbirthday_ans*100.0, n)))
p = 0.5
qbirthday_ans = qbirthday(p, coincident=3)
print(('We need {} people for a triple match '
'with probability of {:.1F}%'.format(qbirthday_ans, p*100.0)))
```

To simulate the birthday problem, we can use

In [22]:

```
np.random.seed(13)
b = np.random.choice(np.arange(1, 365+1), size=23, replace=True)
u, c = np.unique(b, return_counts=True)
```

to generate random birthdays for 23 people and then use `numpy.unique`

, specifiying `return_counts=True`

, to tally up the counts of how many people were born on each day. We can run 10^{4} repetitions as follows:

In [23]:

```
np.random.seed(21)
n = 23
trials = 10**4
ordered = np.arange(1, n+1)
m = 0
for i in range(trials):
b = np.random.choice(np.arange(1, 365+1), size=n, replace=True)
u, c = np.unique(b, return_counts=True)
if np.sum(c > 1):
m += 1
p = m / trials
print(('In a simulation with {} trials, '
'the probability of a birthday match with {} people '
'is {}'.format(trials, n, p)))
```

If the probabilities of various days are not all equal, the calculation becomes much more difficult, but the simulation can easily be extended since `numpy.random.choice`

allows us to specify a list `p`

of the probability of each day (by default sample assigns equal probabilities, so in the above the probability is 1/365 for each day).

© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science).