Notebook

B¶

Python / NumPy / SciPy¶

This Jupyter notebook is the Python equivalent of the R code in appendix B, R, pp. 561 - 565, Introduction to Probability, 1st Edition, Blitzstein & Hwang.

B.1 Vectors¶

Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a NumPy array, as there are many advantages to using a NumPy array over a regular Python lists. The rules for operating on NumPy arrays of different shape are quite different from vectors in R (please see Broadcasting).

Building on NumPy, SciPy is provides functions for mathematics, science, and engineering. The scipy.stats module contains a large number of probability distributions as well as a growing library of statistical functions.

Command	What it does
`numpy.array([1, 1, 0, 2.7, 3.1])`	creates the array [1, 1, 0, 2.7, 3.1]
`numpy.arange(1, 101)`	creates the array [1, 2, ..., 100]
`numpy.arange(1, 100)**3`	creates the array [1³, 2³, ..., 100³]
`numpy.zeros(50)`	creates the array [0, 0,..., 0] of length 50
`numpy.arange(0, 100, 3)`	creates the array [0, 3, 6, 9, ..., 99]
`v[4]`	5^th entry of vector $\mathtt{v}$ (index starts at 0)
`v[[True, True, True, True, False, True, ... , True]]`	all but the 5^th entry of $\mathtt{v}$
`v[[2, 0, 3]]`	3^rd, 1^st, 4^th entries of array $\mathtt{v}$
`v[v>2]`	entries of $\mathtt{v}$ that exceed 2
`numpy.where(v > 2)`	indices of $\mathtt{v}$ such that entry exceeds 2
`numpy.where(v == 7)`	indices of $\mathtt{v}$ such that entry equals 7
`v.min()`	minimum of $\mathtt{v}$
`v.max()`	maximum of $\mathtt{v}$
`numpy.where(v==v.max())`	indices where $\mathtt{v.max()}$ is achieved
`v.sum()`	sum of the entries in $\mathtt{v}$
`v.cumsum()`	cumulative sums of the entries in $\mathtt{v}$
`v.prod()`	product of the entries in $\mathtt{v}$
`scipy.stats.rankdata(v)`	ranks of the entries in $\mathtt{v}$
`v.size`	number of elements in array $\mathtt{v}$
`numpy.sort(v)`	return a sorted copy of array $\mathtt{v}$ (in increasing order)
`numpy.unique(v)`	lists each element of $\mathtt{v}$ once, without duplicates
`numpy.unique(v, return_counts=True)`	tallies how many times each element of $\mathtt{v}$ occurs
`dict(zip(*numpy.unique(v, return_counts=True)))`	return mapping of array $\mathtt{v}$ element keys and frequency values
`numpy.hstack([v, w])`	concatenates vectors $\mathtt{v}$ and $\mathtt{w}$
`numpy.union1d(v, w)`	union of $\mathtt{v}$ and $\mathtt{w}$ as sets
`numpy.intersect1d(v, w)`	intersection of $\mathtt{v}$ and $\mathtt{w}$ as sets
`v + w`	adds $\mathtt{v}$ and $\mathtt{w}$ entrywise, if w is a scalar (see Broadcasting)
`v * w`	multiplies $\mathtt{v}$ and $\mathtt{w}$ entrywise, if w is a scalar (see Broadcasting)

B.2 Matrices¶

From Linear Algebra (scipy.linalg) page at SciPy.org:

The use of the numpy.matrix class is discouraged, since it adds nothing that cannot be accomplished with 2D numpy.ndarray objects.
scipy.linalg contains all the functions in numpy.linalg, plus some other more advanced ones... Another advantage of using scipy.linalg over numpy.linalg is that it is always compiled with BLAS/LAPACK support... Therefore, unless you don't want to add scipy as a dependency to your numpy program, use scipy.linalg instead of numpy.linalg

Therefore, these notebooks will use numpy.ndarray for matrices, and use scipy.linalg for its linear algebra functionality.

Command	What it does
`numpy.array([[1,5], [3,7]])`	creates the matrix $\begin{pmatrix} 1 & 5 \\ 3 & 7 \end{pmatrix}$
`A.shape`	gives the dimensions of matrix `A`
`numpy.diag(A)`	extracts the diagonal of matrix `A`
`numpy.diag([1, 7])`	creates the diagonal matrix $\begin{pmatrix} 1 & 0 \\ 0 & 7 \end{pmatrix}$
`numpy.stack([u, v, w])`	binds arrays `u`, `v`, `w` into a matrix, as rows
`numpy.column_stack([u, v, w])`	binds arrays `u`, `v`, `w` into a matrix, as columns
`A.T`	transpose of matrix `A`
`A[1,2]`	row 2, column 3 entry of matrix `A`
`A[1, :]`	row 2 of matrix `A` (as a vector)
`A[:, 2]`	column 3 of matrix `A` (as a vector)
`A[[0,2],:][:,[1,3]]`	submatrix of A, keeping rows 1, 3 and columns 2, 4 see Array object, Combining advanced and basic indexing
`A.sum(axis=1)`	row sums of matrix `A`
`A.mean(axis=1)`	row averages of matrix `A`
`A.sum(axis=0)`	column sums of matrix `A`
`A.mean(axis=0)`	column averages of matrix `A`
`scipy.linalg.eig(A)`	eigenvalues and eigenvectors of matrix `A`
`scipy.linalg.inv(A)`	matrix `A`⁻¹
`scipy.linalg.solve(A, b)`	solves `Ax = b` for `x` (where `b` is a column vector)
`A @ B`	matrix multiplication `AB`
`scipy.linalg.fractional_matrix_power(A, k)`	matrix power `A`^k

B.3 Math¶

NumPy has many convenient mathematical functions, which are geared for operations on scalars, vectors and matrices.
SciPy also offers functions similar to NumPy, but the SciPy submodules are more extensive, and, depending upon your Python distribution, may also already be optimized to take advantage of the Intel Math Kernel Library, LAPACK, and BLAS.
Thus, unless you have clear reasons for not using SciPy, we suggest you use the scipy submodules whenever possible.
Note that while many of these functions are provided in the math standard Python module, the NumPy and SciPy functions below work out of the box on scalar values as well as vectors.

Command	What it does
`numpy.abs(x)`	$\lvert x \rvert$
`numpy.exp(x)`	$e^x$
`numpy.log(x)`	$log(x)$
`numpy.log(x) / numpy.log(b)`	$log_{b}(x)$
`numpy.sqrt(x)`	$\sqrt{x}$
`numpy.floor(x)`	$\lfloor x \rfloor$
`numpy.ceil(x)`	$\lceil x \rceil$
`scipy.special.factorial(n)`	$n!$
`scipy.special.gammaln(n + 1)`	$log(n!)$ (helps prevent overflow)
`scipy.special.gamma(a)`	$\Gamma(a)$
`scipy.special.gammaln(a)`	$log(\Gamma(a))$ (helps prevent overflow)
`scipy.special.comb(n, k)`	binomial coefficient $\binom{n}{k}$
`pbirthday(k)`	(see Birthday problem calculation and simulation, Ch. 1 notebook)
`x2 if x > 0 else \x3`	$x^2$ if $x > 0$, $x^3$ otherwise (piecewise)
`def f(x): return exp(-x)`	defines the function $f$ by $f(x) = e^{-x}$
`scipy.integrate.quad(f, a, b)`	finds $\int_{a}^{b} f(x) \, dx$ numerically
`scipy.optimize.fminbound(f, a, b)`	minimizes $f$ numerically on [a, b]
`scipy.optimize.fminbound(lambda x: -f(x), a, b)`	maximizes $f$ numerically on [a, b]
`scipy.optimize.root(f, x0=numpy.mean([a,b]))`	searches numerically for a zero of $f$ in $[a, b]$ initially guessing at $\frac{a + b}{2}$

B.4 Sampling and simulation¶

Command	What it does
`numpy.random.permutation(7)`	random permutation of 0, 1, ..., 6
`numpy.random.permutation(range(1,8))`	random permutation of 1, 2, ..., 7
`numpy.random.choice(range(1,53), size=5, replace=False)`	picks 5 times from 1, 2, ..., 52 (don't replace)
`numpy.random.choice(list(string.ascii_lowercase),` `size=5, replace=False)`	picks 5 random letters of the alphabet (don't replace) see Common `string` operations
`numpy.random.choice(range(1,4), size=5, p=p)`	picks 5 times from 1, 2, 3 with probabilities $\mathtt{p}$ (replace)
`[experiment() for _ in range(10**4)]`	gathers 10⁴ executions of $\mathtt{experiment}$ function into array

B.5 Plotting¶

The Python-based Jupyter notebooks here in this repository all make use of the matplotlib.pyplot state-based interface to matplotlib.
The %matplotlib inline magic command provides a MATLAB-like way of plotting in the notebook.
matplotlib.pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation
If the matplotlib colormaps are not enough, you can find all sorts of color suggestions at colorbrewer2.org.
The single underscore variable name _ is often used to denote a "throwaway" variable whose value is not used.
- For example, in a for loop, if the loop counter is never really used, you can write a for loop like this:
```
for _ in range(10):
    print("I don't care what iteration I'm on!")
```
- _ is also used to assign but ignore return values from functions. In the matplotlib.pyplot.subplots example below, we ignore the fig returned by assigning that to _.
matplotlib being state-based, you can continue to add to a plot until matplotlib.pyplot.show is called; or in a Jupyter notebook, until the end of a code cell.

Command	What it does
`x = numpy.linspace(a, b, num=n)`	create a vector of `n` equally spaced points from `a` to `b`
`matplotlib.pyplot.plot(x, f(x))`	graphs the given vector `x` and a function `f` on `x`
`matplotlib.pyplot.scatter(x, f(x))`	creates scatter plot of the points $(x_i, \, y_i)$
`matplotlib.pyplot.plot(x, y, linestyle="--")`	creates line plot of the points $(x_i, \, y_i)$, with the given linestyle
`matplotlib.pyplot.scatter(x,y)`	adds the points $(x_i, \, y_i)$ to the plot
`lines(x,y)`	adds line segments through the $(x_i, \, y_i)$ to the plot
`matplotlib.pyplot.plot(x, b*x + a)`	adds the line with intercept `a`, slope `b` to the plot, given `x`
`matplotlib.pyplot.hist(x, bins=b, col="blue")`	blue histogram of the values in `x`, with `b` bins
`_, (ax1, ax2) = matplotlib.pyplot.subplots(1, 2)`	tells `matplotlib` we want 2 side-by-side plots (a 1 $\times$ 2 array of plots) where the axes assigned are `ax1` and `ax2`

B.6 Programming¶

Command	What it does
`x = numpy.pi`	sets $\mathtt{x}$ equal to $\pi$
`x>3 and x<5`	Is $\mathtt{x > 3}$ and $\mathtt{x < 5}$? ($\mathtt{True/False}$)
`x>3 or x<5`	Is $\mathtt{x > 3}$ or $\mathtt{x < 5}$? ($\mathtt{True/False}$)
`if n>3: x = x + 1`	adds 1 to $\mathtt{x}$ if $\mathtt{n > 3}$
`x = x+1 if n==0 else x + 2`	adds 1 to $\mathtt{x}$ if $\mathtt{n = 0}$, else adds 2

B.7 Summary statistics¶

Command	What it does
`numpy.mean(v)`	sample mean of array `v`
`numpy.var(v, ddof=1)`	sample variance of vector `v`
`numpy.std(v, ddof=1)`	sample standard deviation of vector `v`
`numpy.median(v)`	sample median of vector `v`
`numpy.percentile(v, [0, 25, 50, 75, 100])`	min, 1^st quartile, median, 3^rd quartile, max of `v`
`numpy.percentile(v, p)`	p^th sample quantile of vector `v`
`numpy.cov(v, w, ddof=1)[0,1]`	sample covariance of vectors `v` and `w`
`numpy.corrcoef(v, w)[0,1]`	sample correlation of vectors `v` and `w`

B.8 Distributions¶

Discrete distributions in scipy.stats
Continuous distributions in scipy.stats
Multivariate distributions in scipy.stats
Please see the Statistical functions page for scipy.stats for a full list of the probability distributions and statistical functions provided.

Command	What it does
`help(`distribution`)`	shows documentation on the distribution named
`scipy.stats.binom.pmf(k, n, p)`	PMF $P(X = k)$ for $X \sim Bin(n, p)$
`scipy.stats.binom.cdf(x, n, p)`	CDF $P(X \leq x)$ for $X \sim Bin(n, p)$
`scipy.stats.binom.ppf(q, n, p)`	quantile $P(X \leq x) \geq q$ for $X \sim Bin(n, p)$
`scipy.stats.binom.rvs(n, p, size=r)`	vector of `r` i.i.d. $Bin(n, p)$ r.v.s
`scipy.stats.geom.pmf(k, p)`	PMF $P(X=k) = (1-p)^{k-1}\,p$
`scipy.stats.hypergeom.pmf(k, M, n, N)`	PMF $P(X=k) = \frac{\binom{n}{k}\binom{M-n}{N-k}}{\binom{M}{N}} $
`scipy.stats.nbinom.pmf(k, n, p)`	PMF $P(X=k) = \binom{k+n-1}{n-1} \, p^n \, (1-p)^k$
`scipy.stats.poisson.pmf(k, mu)`	PMF $P(X=k) = e^{-\mu}\,\frac{\mu^k}{k!}$
`scipy.stats.beta.pdf(x,a,b)`	PDF $f(x) = \frac{\Gamma(a+b)\,x^{a-1}\,(1-x)^{b-1}}{\Gamma(a)\,\Gamma(b)}$
`scipy.stats.cauchy.pdf(x)`	PDF $f(x) \frac{1}{\pi \, (1 + x^2)}$
`scipy.stats.chi2.pdf(x, df)`	PDF $f(x)$ for $X \sim \chi^2_{df}$
`scipy.stats.expon.pdf(x, scale=1/b)`	PDF $f(x)$ for $X \sim Expo(b)$
`scipy.stats.gamma.pdf(x, a, scale=1/b)`	PDF $f(x)$ for $X \sim Gamma(a, b)$
`scipy.stats.lognorm.pdf(x, loc=m, scale=s)`	PDF $f(x)$ for $X \sim \mathcal{LN}(m, s^2)$
`scipy.stats.norm.pdf(x, loc=m, scale=s)`	PDF $f(x)$ for $X \sim \mathcal{N}(m, s^2)$
`scipy.stats.t.pdf(x, df)`	PDF $f(x)$ for $X \sim t_{df}$
`scipy.stats.uniform.pdf(x, a, a+b)`	PDF $f(x)$ for $X \sim Unif(a, b)$

The application programming interface (API for the scipy.stats module is fairly consistent for the distributions supported. Discrete distributions have pmf, cdf, and rvs functions for probability mass, cumulative distribution, and random value generation. Similarly for continuous distributions, the functions for pdf, cdf, and rvs correspond to probability density, cumulative distribution, and random value generation. Note that for Normal distributions, the mean and standard deviation need to be provided as the parameters, rather than mean and variance.

Among the multivariate distributions supported, scipy.stats.multinomial has pmf and rvs for probability mass function and random value generation. scipy.stats.multivariate_normal has pdf, cdf, and rvs for probabilty density, cumulative distribution, and random value generation functions.