This Jupyter notebook is the Python equivalent of the R code in section 3.11 R, pp. 126 - 128, Introduction to Probability, 1st Edition, Blitzstein & Hwang.

In [1]:

```
import numpy as np
```

All of the named distributions that we'll encounter in this book have been implemented in R, but in this Python-based notebook we will use their equivalents in SciPy. In this section we'll explain how to work with the Binomial and Hypergeometric distributions in `scipy.stats`

. We will also explain in general how to generate r.v.s from any discrete distribution with a finite support. The aforementioned Statistical functions page is a handy list of the distributions in `scipy.stats`

. Typing `scipy.stats.[distribution].__doc__`

will display more information on the named distribution.

In general for the named discrete distributions in `scipy.stats`

, the three functions `pmf`

, `cdf`

, and `rvs`

will give the PMF, CDF, and random generation, respectively.

The Binomial distribution `binom`

provides the following three functions: `pmf`

, `cdf`

, and `rvs`

. Unlike R, SciPy provides an implementation of the Bernoulli distribution in `scipy.stats.bernoulli`

. But as an alterative for `bernoulli`

, we can just use the Binomial functions with $k \in \{0,1\}$ and $n = 1$.

In [2]:

```
from scipy.stats import binom
# to learn more about scipy.stats.binom, un-comment ouf the following line
#print(binom.__doc__)
```

`binom.pmf`

is`scipy.stats`

' implementation of the Binomial PMF. It takes three inputs: the first is the value of`k`

at which to evaluate the PMF, and the second and third are the parameters`n`

and`p`

. For example,`binom.pmf(3, 5, 0.2)`

returns the probability $P(X = 3)$ where $X \sim Bin(5, 0.2)$. In other words,

In [3]:

```
k = 3
n = 5
p = 0.2
binom.pmf(k, n, p)
```

Out[3]:

`binom.cdf`

is the Binomial CDF. It takes three inputs: the first is the value of`k`

at which to evaluate the CDF, and the second and third are the parameters. $binom.cdf(3, 5, 0.2)$ is the probability $P(X \leq 3)$ where $X \sim Bin(5, 0.2)$. So

In [4]:

```
k = 3
n = 5
p = 0.2
binom.cdf(k, n, p)
```

Out[4]:

`binom.rvs`

is a function for generating Binomial random variables. For`rvs`

, the first and second inputs are still the parameters`n`

and`p`

, and the`size`

parameter is how many r.v.s we want to generate. Thus the command`binom.rvs(5, 0.2, size=7)`

produces realizations of seven i.i.d. $Bin(5, 0.2)$ r.v.s. When we ran this command, we got

In [5]:

```
# seed the random number generator
np.random.seed(144)
n = 5
p = 0.2
binom.rvs(n, p, size=7)
```

Out[5]:

Unless you change the `numpy.random.seed`

parameter value in code cell [1] above, you'll get the same values above the first time this notebook is run.

We can also evaluate PMFs and CDFs at an entire vector of values. For example, recall that `numpy.arange(0,n+1)`

is a quick way to list the integers from $0$ to $n$. The command `binom.pmf(numpy.arange(0, 5+1), 5, 0.2)`

returns 6 numbers, $P(X = 0)\text{, } P(X = 1)\text{, } \cdots \text{, } P(X = 5)$, where $X \sim Bin(5, 0.2)$.

In [6]:

```
n = 5
p = 0.2
binom.pmf(np.arange(0, n+1), n, p)
```

Out[6]:

The Hypergeometric distribution is implemented in `scipy.stats.hypergeom`

, which similarly provides the following three functions: `pmf`

, `cdf`

, and `rvs`

. As one might expect, `pmf`

is the Hypergeometric PMF, `cdf`

is the Hypergeometric CDF, and `rvs`

generates Hypergeometric r.v.s. Since the Hypergeometric distribution has three parameters, each of these functions takes at least three inputs. For `pmf`

and `cdf`

, the first input is the value at which we wish to evaluate the PMF or CDF, and the remaining three inputs are the parameters of the distribution.

In [7]:

```
from scipy.stats import hypergeom
# to learn more about scipy.stats.hypergeom, un-comment ouf the following line
#print(hypergeom.__doc__)
```

The PMF is defined as

\begin{align} p(k, M, n, N) &= \frac{\binom{n}{k} \, \binom{M-n}{N-k}}{\binom{M}{N}} \end{align}where

- $M$ = the total number of objects (black and white balls, for instance)
- $n$ = the total number of objects of Type I (say we're interested in the white balls)
- $N$ = the total number of objects drawn without replacement from $M$

Consider the case where we have a total of $M = 17$ balls (black and white), with $n = 10$ among them being white (and 7 being black), and we draw $N = 8$ of the balls without replacement.

In [8]:

```
n = 10 # white balls <- these are the ones we are interested in
b = 7 # black balls
M = n + b # total number of balls
N = 8 # number of balls to draw w/out replacement
```

`hypergeom.pmf(k, 17, 10, 8)`

returns the probability $P(X = k)$ where $X \sim HGeom(10, 7, 8)$

In [9]:

```
# k = 1
hypergeom.pmf(1, M, n, N)
```

Out[9]:

`hypergeom.cdf(k, 17, 10, 8)`

returns $P(X \leq k)$

In [10]:

```
# k = 2
hypergeom.cdf(2, M, n, N)
```

Out[10]:

`hypergeom.rvs(17, 10, 8, size=s)`

uses the`size`

parameter to specify the number of r.v.s. we want to generate.`hypergeom.rvs(17, 10, 8, size=100)`

generates 100 i.i.d. $HGeom(10, 7, 8)$ r.v.s.

In [11]:

```
np.random.seed(233)
hypergeom.rvs(M, n, N, size=100)
```

Out[11]:

We can generate r.v.s from any discrete distribution with finite support using the `numpy.random.choice`

function. When we first introduced the `numpy.random.choice`

function, we said that it can be used in the form `numpy.random.choice(numpy.arange(1,n+1), k, replace=False)`

or `numpy.random.choice(np.arange(1,n+1), k)`

to sample $k$ times from the integers 1 through $n$, either without or with replacement. For example, to generate 5 independent $DUnif(1, 2, \ldots, 100$) r.v.s, we can use the function invocation `numpy.random.choice(numpy.arange(1,100+1), 5)`

.

It turns out that `numpy.random.choice`

is far more versatile. If we want to sample from the values $x_1, \ldots, x_n$ with probabilities $p_1, \ldots, p_n$, we simply create an array `x`

containing all the $x_i$ and an array `p`

containing all the $p_i$, then feed them into `choice`

. For example, suppose we are interested in generating realizations of i.i.d. r.v.s $X_1, \ldots, X_{100}$ whose PMF is

and $P(X_j = x) = 0$ for all other values of $x$. First, we use the `numpy.array`

function to create arrays with the support of the distribution and the PMF probabilities.

In [12]:

```
x = np.array([0, 1, 5, 10])
p = np.array([0.25, 0.5, 0.1, 0.15])
```

Next, we use the `choice`

function. Here's how to get 100 samples from the above PMF:

In [13]:

```
np.random.seed(377)
np.random.choice(x, 100, p=p)
```

Out[13]:

The inputs are the array `x`

to sample from, the number of samples to generate (100 in this case), the probabilities `p`

of sampling the values in `x`

(if this is omitted, the probabilities are assumed equal), and whether to sample with replacement (the default is `replace=True`

).

© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science).