B

Python / NumPy / SciPy

This Jupyter notebook is the Python equivalent of the R code in appendix B, R, pp. 561 - 565, Introduction to Probability, 2nd Edition, Blitzstein & Hwang.


B.1 Vectors

Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a NumPy array, as there are many advantages to using a NumPy array over a regular Python lists. The rules for operating on NumPy arrays of different shape are quite different from vectors in R (please see Broadcasting).

Building on NumPy, SciPy is provides functions for mathematics, science, and engineering. The scipy.stats module contains a large number of probability distributions as well as a growing library of statistical functions.

Command What it does
numpy.array([1, 1, 0, 2.7, 3.1]) creates the array [1, 1, 0, 2.7, 3.1]
numpy.arange(1, 101) creates the array [1, 2, ..., 100]
numpy.arange(1, 100)**3 creates the array [13, 23, ..., 1003]
numpy.zeros(50) creates the array [0, 0,..., 0] of length 50
numpy.arange(0, 100, 3) creates the array [0, 3, 6, 9, ..., 99]
v[4] 5th entry of vector $\mathtt{v}$ (index starts at 0)
v[[True, True, True, True, False, True, ... , True]] all but the 5th entry of $\mathtt{v}$
v[[2, 0, 3]] 3rd, 1st, 4th entries of array $\mathtt{v}$
v[v>2] entries of $\mathtt{v}$ that exceed 2
numpy.where(v > 2) indices of $\mathtt{v}$ such that entry exceeds 2
numpy.where(v == 7) indices of $\mathtt{v}$ such that entry equals 7
v.min() minimum of $\mathtt{v}$
v.max() maximum of $\mathtt{v}$
numpy.where(v==v.max()) indices where $\mathtt{v.max()}$ is achieved
v.sum() sum of the entries in $\mathtt{v}$
v.cumsum() cumulative sums of the entries in $\mathtt{v}$
v.prod() product of the entries in $\mathtt{v}$
scipy.stats.rankdata(v) ranks of the entries in $\mathtt{v}$
v.size number of elements in array $\mathtt{v}$
numpy.sort(v) return a sorted copy of array $\mathtt{v}$ (in increasing order)
numpy.unique(v) lists each element of $\mathtt{v}$ once, without duplicates
numpy.unique(v, return_counts=True) tallies how many times each element of $\mathtt{v}$ occurs
dict(zip(*numpy.unique(v, return_counts=True))) return mapping of array $\mathtt{v}$ element keys and frequency values
numpy.hstack([v, w]) concatenates vectors $\mathtt{v}$ and $\mathtt{w}$
numpy.union1d(v, w) union of $\mathtt{v}$ and $\mathtt{w}$ as sets
numpy.intersect1d(v, w) intersection of $\mathtt{v}$ and $\mathtt{w}$ as sets
v + w adds $\mathtt{v}$ and $\mathtt{w}$ entrywise, if w is a scalar (see Broadcasting)
v * w multiplies $\mathtt{v}$ and $\mathtt{w}$ entrywise, if w is a scalar (see Broadcasting)

B.2 Matrices

From Linear Algebra (scipy.linalg) page at SciPy.org:

  • The use of the numpy.matrix class is discouraged, since it adds nothing that cannot be accomplished with 2D numpy.ndarray objects.
  • scipy.linalg contains all the functions in numpy.linalg, plus some other more advanced ones... Another advantage of using scipy.linalg over numpy.linalg is that it is always compiled with BLAS/LAPACK support... Therefore, unless you don't want to add scipy as a dependency to your numpy program, use scipy.linalg instead of numpy.linalg

Therefore, these notebooks will use numpy.ndarray for matrices, and use scipy.linalg for its linear algebra functionality.

Command What it does
numpy.array([[1,5], [3,7]]) creates the matrix $\begin{pmatrix} 1 & 5 \\ 3 & 7 \end{pmatrix}$
A.shape gives the dimensions of matrix A
numpy.diag(A) extracts the diagonal of matrix A
numpy.diag([1, 7]) creates the diagonal matrix $\begin{pmatrix} 1 & 0 \\ 0 & 7 \end{pmatrix}$
numpy.stack([u, v, w]) binds arrays u, v, w into a matrix, as rows
numpy.column_stack([u, v, w]) binds arrays u, v, w into a matrix, as columns
A.T transpose of matrix A
A[1,2] row 2, column 3 entry of matrix A
A[1, :] row 2 of matrix A (as a vector)
A[:, 2] column 3 of matrix A (as a vector)
A[[0,2],:][:,[1,3]] submatrix of A, keeping rows 1, 3 and columns 2, 4
see Array object, Combining advanced and basic indexing
A.sum(axis=1) row sums of matrix A
A.mean(axis=1) row averages of matrix A
A.sum(axis=0) column sums of matrix A
A.mean(axis=0) column averages of matrix A
scipy.linalg.eig(A) eigenvalues and eigenvectors of matrix A
scipy.linalg.inv(A) matrix A−1
scipy.linalg.solve(A, b) solves Ax = b for x (where b is a column vector)
A @ B matrix multiplication AB
scipy.linalg.fractional_matrix_power(A, k) matrix power Ak

B.3 Math

  • NumPy has many convenient mathematical functions, which are geared for operations on scalars, vectors and matrices.
  • SciPy also offers functions similar to NumPy, but the SciPy submodules are more extensive, and, depending upon your Python distribution, may also already be optimized to take advantage of the Intel Math Kernel Library, LAPACK, and BLAS.
  • Thus, unless you have clear reasons for not using SciPy, we suggest you use the scipy submodules whenever possible.
  • Note that while many of these functions are provided in the math standard Python module, the NumPy and SciPy functions below work out of the box on scalar values as well as vectors.
Command What it does
numpy.abs(x) $\lvert x \rvert$
numpy.exp(x) $e^x$
numpy.log(x) $log(x)$
numpy.log(x) / numpy.log(b) $log_{b}(x)$
numpy.sqrt(x) $\sqrt{x}$
numpy.floor(x) $\lfloor x \rfloor$
numpy.ceil(x) $\lceil x \rceil$
scipy.special.factorial(n) $n!$
scipy.special.gammaln(n + 1) $log(n!)$ (helps prevent overflow)
scipy.special.gamma(a) $\Gamma(a)$
scipy.special.gammaln(a) $log(\Gamma(a))$ (helps prevent overflow)
scipy.special.comb(n, k) binomial coefficient $\binom{n}{k}$
pbirthday(k) (see Birthday problem calculation and simulation, Ch. 1 notebook)
x**2 if x > 0 else \x**3 $x^2$ if $x > 0$, $x^3$ otherwise (piecewise)
def f(x): return exp(-x) defines the function $f$ by $f(x) = e^{-x}$
scipy.integrate.quad(f, a, b) finds $\int_{a}^{b} f(x) \, dx$ numerically
scipy.optimize.fminbound(f, a, b) minimizes $f$ numerically on [a, b]
scipy.optimize.fminbound(lambda x: -f(x), a, b) maximizes $f$ numerically on [a, b]
scipy.optimize.root(f, x0=numpy.mean([a,b])) searches numerically for a zero of $f$ in $[a, b]$
initially guessing at $\frac{a + b}{2}$

B.4 Sampling and simulation

Command What it does
numpy.random.permutation(7) random permutation of 0, 1, ..., 6
numpy.random.permutation(range(1,8)) random permutation of 1, 2, ..., 7
numpy.random.choice(range(1,53), size=5, replace=False) picks 5 times from 1, 2, ..., 52 (don't replace)
numpy.random.choice(list(string.ascii_lowercase),
size=5, replace=False)
picks 5 random letters of the alphabet (don't replace)
see Common string operations
numpy.random.choice(range(1,4), size=5, p=p) picks 5 times from 1, 2, 3 with probabilities $\mathtt{p}$ (replace)
[experiment() for _ in range(10**4)] gathers 104 executions of $\mathtt{experiment}$ function into array

B.5 Plotting

  • The Python-based Jupyter notebooks here in this repository all make use of the matplotlib.pyplot state-based interface to matplotlib.
  • The %matplotlib inline magic command provides a MATLAB-like way of plotting in the notebook.
  • matplotlib.pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation
  • If the matplotlib colormaps are not enough, you can find all sorts of color suggestions at colorbrewer2.org.
  • The single underscore variable name _ is often used to denote a "throwaway" variable whose value is not used.

    • For example, in a for loop, if the loop counter is never really used, you can write a for loop like this:

      for _ in range(10):
          print("I don't care what iteration I'm on!")
    • _ is also used to assign but ignore return values from functions. In the matplotlib.pyplot.subplots example below, we ignore the fig returned by assigning that to _.

  • matplotlib being state-based, you can continue to add to a plot until matplotlib.pyplot.show is called; or in a Jupyter notebook, until the end of a code cell.
Command What it does
x = numpy.linspace(a, b, num=n) create a vector of n equally spaced points from a to b
matplotlib.pyplot.plot(x, f(x)) graphs the given vector x and a function f on x
matplotlib.pyplot.scatter(x, f(x)) creates scatter plot of the points $(x_i, \, y_i)$
matplotlib.pyplot.plot(x, y, linestyle="--") creates line plot of the points $(x_i, \, y_i)$, with the given linestyle
matplotlib.pyplot.scatter(x,y) adds the points $(x_i, \, y_i)$ to the plot
lines(x,y) adds line segments through the $(x_i, \, y_i)$ to the plot
matplotlib.pyplot.plot(x, b*x + a) adds the line with intercept a, slope b to the plot, given x
matplotlib.pyplot.hist(x, bins=b, col="blue") blue histogram of the values in x, with b bins
_, (ax1, ax2) = matplotlib.pyplot.subplots(1, 2) tells matplotlib we want 2 side-by-side plots (a 1 $\times$ 2 array of plots)
where the axes assigned are ax1 and ax2

B.6 Programming

Command What it does
x = numpy.pi sets $\mathtt{x}$ equal to $\pi$
x>3 and x<5 Is $\mathtt{x > 3}$ and $\mathtt{x < 5}$? ($\mathtt{True/False}$)
x>3 or x<5 Is $\mathtt{x > 3}$ or $\mathtt{x < 5}$? ($\mathtt{True/False}$)
if n>3: x = x + 1 adds 1 to $\mathtt{x}$ if $\mathtt{n > 3}$
x = x+1 if n==0 else x + 2 adds 1 to $\mathtt{x}$ if $\mathtt{n = 0}$, else adds 2

B.7 Summary statistics

Command What it does
numpy.mean(v) sample mean of array v
numpy.var(v, ddof=1) sample variance of vector v
numpy.std(v, ddof=1) sample standard deviation of vector v
numpy.median(v) sample median of vector v
numpy.percentile(v, [0, 25, 50, 75, 100]) min, 1st quartile, median, 3rd quartile, max of v
numpy.percentile(v, p) pth sample quantile of vector v
numpy.cov(v, w, ddof=1)[0,1] sample covariance of vectors v and w
numpy.corrcoef(v, w)[0,1] sample correlation of vectors v and w

B.8 Distributions

Command What it does
help(distribution) shows documentation on the distribution named
scipy.stats.binom.pmf(k, n, p) PMF $P(X = k)$ for $X \sim Bin(n, p)$
scipy.stats.binom.cdf(x, n, p) CDF $P(X \leq x)$ for $X \sim Bin(n, p)$
scipy.stats.binom.ppf(q, n, p) quantile $P(X \leq x) \geq q$ for $X \sim Bin(n, p)$
scipy.stats.binom.rvs(n, p, size=r) vector of r i.i.d. $Bin(n, p)$ r.v.s
scipy.stats.geom.pmf(k, p) PMF $P(X=k) = (1-p)^{k-1}\,p$
scipy.stats.hypergeom.pmf(k, M, n, N) PMF $P(X=k) = \frac{\binom{n}{k}\binom{M-n}{N-k}}{\binom{M}{N}} $
scipy.stats.nbinom.pmf(k, n, p) PMF $P(X=k) = \binom{k+n-1}{n-1} \, p^n \, (1-p)^k$
scipy.stats.poisson.pmf(k, mu) PMF $P(X=k) = e^{-\mu}\,\frac{\mu^k}{k!}$
scipy.stats.beta.pdf(x,a,b) PDF $f(x) = \frac{\Gamma(a+b)\,x^{a-1}\,(1-x)^{b-1}}{\Gamma(a)\,\Gamma(b)}$
scipy.stats.cauchy.pdf(x) PDF $f(x) \frac{1}{\pi \, (1 + x^2)}$
scipy.stats.chi2.pdf(x, df) PDF $f(x)$ for $X \sim \chi^2_{df}$
scipy.stats.expon.pdf(x, scale=1/b) PDF $f(x)$ for $X \sim Expo(b)$
scipy.stats.gamma.pdf(x, a, scale=1/b) PDF $f(x)$ for $X \sim Gamma(a, b)$
scipy.stats.lognorm.pdf(x, loc=m, scale=s) PDF $f(x)$ for $X \sim \mathcal{LN}(m, s^2)$
scipy.stats.norm.pdf(x, loc=m, scale=s) PDF $f(x)$ for $X \sim \mathcal{N}(m, s^2)$
scipy.stats.t.pdf(x, df) PDF $f(x)$ for $X \sim t_{df}$
scipy.stats.uniform.pdf(x, a, a+b) PDF $f(x)$ for $X \sim Unif(a, b)$

The application programming interface (API for the scipy.stats module is fairly consistent for the distributions supported. Discrete distributions have pmf, cdf, and rvs functions for probability mass, cumulative distribution, and random value generation. Similarly for continuous distributions, the functions for pdf, cdf, and rvs correspond to probability density, cumulative distribution, and random value generation. Note that for Normal distributions, the mean and standard deviation need to be provided as the parameters, rather than mean and variance.

Among the multivariate distributions supported, scipy.stats.multinomial has pmf and rvs for probability mass function and random value generation. scipy.stats.multivariate_normal has pdf, cdf, and rvs for probabilty density, cumulative distribution, and random value generation functions.


Joseph K. Blitzstein and Jessica Hwang, Harvard University and Stanford University, © 2019 by Taylor and Francis Group, LLC