This Jupyter notebook is the Python equivalent of the R code in appendix B, R, pp. 561 - 565, Introduction to Probability, 2nd Edition, Blitzstein & Hwang.
Rather than using the usual Python list (array), for probability and statistics it is more convenient to use a NumPy array, as there are many advantages to using a NumPy array over a regular Python lists. The rules for operating on NumPy arrays of different shape are quite different from vectors in R (please see Broadcasting).
Building on NumPy, SciPy is provides functions for mathematics, science, and engineering. The scipy.stats
module contains a large number of probability distributions as well as a growing library of statistical functions.
Command | What it does |
---|---|
numpy.array([1, 1, 0, 2.7, 3.1]) |
creates the array [1, 1, 0, 2.7, 3.1] |
numpy.arange(1, 101) |
creates the array [1, 2, ..., 100] |
numpy.arange(1, 100)**3 |
creates the array [13, 23, ..., 1003] |
numpy.zeros(50) |
creates the array [0, 0,..., 0] of length 50 |
numpy.arange(0, 100, 3) |
creates the array [0, 3, 6, 9, ..., 99] |
v[4] |
5th entry of vector $\mathtt{v}$ (index starts at 0) |
v[[True, True, True, True, False, True, ... , True]] |
all but the 5th entry of $\mathtt{v}$ |
v[[2, 0, 3]] |
3rd, 1st, 4th entries of array $\mathtt{v}$ |
v[v>2] |
entries of $\mathtt{v}$ that exceed 2 |
numpy.where(v > 2) |
indices of $\mathtt{v}$ such that entry exceeds 2 |
numpy.where(v == 7) |
indices of $\mathtt{v}$ such that entry equals 7 |
v.min() |
minimum of $\mathtt{v}$ |
v.max() |
maximum of $\mathtt{v}$ |
numpy.where(v==v.max()) |
indices where $\mathtt{v.max()}$ is achieved |
v.sum() |
sum of the entries in $\mathtt{v}$ |
v.cumsum() |
cumulative sums of the entries in $\mathtt{v}$ |
v.prod() |
product of the entries in $\mathtt{v}$ |
scipy.stats.rankdata(v) |
ranks of the entries in $\mathtt{v}$ |
v.size |
number of elements in array $\mathtt{v}$ |
numpy.sort(v) |
return a sorted copy of array $\mathtt{v}$ (in increasing order) |
numpy.unique(v) |
lists each element of $\mathtt{v}$ once, without duplicates |
numpy.unique(v, return_counts=True) |
tallies how many times each element of $\mathtt{v}$ occurs |
dict(zip(*numpy.unique(v, return_counts=True))) |
return mapping of array $\mathtt{v}$ element keys and frequency values |
numpy.hstack([v, w]) |
concatenates vectors $\mathtt{v}$ and $\mathtt{w}$ |
numpy.union1d(v, w) |
union of $\mathtt{v}$ and $\mathtt{w}$ as sets |
numpy.intersect1d(v, w) |
intersection of $\mathtt{v}$ and $\mathtt{w}$ as sets |
v + w |
adds $\mathtt{v}$ and $\mathtt{w}$ entrywise, if w is a scalar (see Broadcasting) |
v * w |
multiplies $\mathtt{v}$ and $\mathtt{w}$ entrywise, if w is a scalar (see Broadcasting) |
From Linear Algebra (scipy.linalg) page at SciPy.org:
numpy.matrix
class is discouraged, since it adds nothing that cannot be accomplished with 2D numpy.ndarray
objects.scipy.linalg
contains all the functions in numpy.linalg
, plus some other more advanced ones... Another advantage of using scipy.linalg
over numpy.linalg
is that it is always compiled with BLAS/LAPACK support... Therefore, unless you don't want to add scipy as a dependency to your numpy program, use scipy.linalg
instead of numpy.linalg
Therefore, these notebooks will use numpy.ndarray
for matrices, and use scipy.linalg
for its linear algebra functionality.
Command | What it does |
---|---|
numpy.array([[1,5], [3,7]]) |
creates the matrix $\begin{pmatrix} 1 & 5 \\ 3 & 7 \end{pmatrix}$ |
A.shape |
gives the dimensions of matrix A |
numpy.diag(A) |
extracts the diagonal of matrix A |
numpy.diag([1, 7]) |
creates the diagonal matrix $\begin{pmatrix} 1 & 0 \\ 0 & 7 \end{pmatrix}$ |
numpy.stack([u, v, w]) |
binds arrays u , v , w into a matrix, as rows |
numpy.column_stack([u, v, w]) |
binds arrays u , v , w into a matrix, as columns |
A.T |
transpose of matrix A |
A[1,2] |
row 2, column 3 entry of matrix A |
A[1, :] |
row 2 of matrix A (as a vector) |
A[:, 2] |
column 3 of matrix A (as a vector) |
A[[0,2],:][:,[1,3]] |
submatrix of A, keeping rows 1, 3 and columns 2, 4 see Array object, Combining advanced and basic indexing |
A.sum(axis=1) |
row sums of matrix A |
A.mean(axis=1) |
row averages of matrix A |
A.sum(axis=0) |
column sums of matrix A |
A.mean(axis=0) |
column averages of matrix A |
scipy.linalg.eig(A) |
eigenvalues and eigenvectors of matrix A |
scipy.linalg.inv(A) |
matrix A −1 |
scipy.linalg.solve(A, b) |
solves Ax = b for x (where b is a column vector) |
A @ B |
matrix multiplication AB |
scipy.linalg.fractional_matrix_power(A, k) |
matrix power A k |
scipy
submodules whenever possible.math
standard Python module, the NumPy and SciPy functions below work out of the box on scalar values as well as vectors.Command | What it does |
---|---|
numpy.abs(x) |
$\lvert x \rvert$ |
numpy.exp(x) |
$e^x$ |
numpy.log(x) |
$log(x)$ |
numpy.log(x) / numpy.log(b) |
$log_{b}(x)$ |
numpy.sqrt(x) |
$\sqrt{x}$ |
numpy.floor(x) |
$\lfloor x \rfloor$ |
numpy.ceil(x) |
$\lceil x \rceil$ |
scipy.special.factorial(n) |
$n!$ |
scipy.special.gammaln(n + 1) |
$log(n!)$ (helps prevent overflow) |
scipy.special.gamma(a) |
$\Gamma(a)$ |
scipy.special.gammaln(a) |
$log(\Gamma(a))$ (helps prevent overflow) |
scipy.special.comb(n, k) |
binomial coefficient $\binom{n}{k}$ |
pbirthday(k) |
(see Birthday problem calculation and simulation, Ch. 1 notebook) |
x**2 if x > 0 else \x**3 |
$x^2$ if $x > 0$, $x^3$ otherwise (piecewise) |
def f(x): return exp(-x) |
defines the function $f$ by $f(x) = e^{-x}$ |
scipy.integrate.quad(f, a, b) |
finds $\int_{a}^{b} f(x) \, dx$ numerically |
scipy.optimize.fminbound(f, a, b) |
minimizes $f$ numerically on [a, b] |
scipy.optimize.fminbound(lambda x: -f(x), a, b) |
maximizes $f$ numerically on [a, b] |
scipy.optimize.root(f, x0=numpy.mean([a,b])) |
searches numerically for a zero of $f$ in $[a, b]$ initially guessing at $\frac{a + b}{2}$ |
Command | What it does |
---|---|
numpy.random.permutation(7) |
random permutation of 0, 1, ..., 6 |
numpy.random.permutation(range(1,8)) |
random permutation of 1, 2, ..., 7 |
numpy.random.choice(range(1,53), size=5, replace=False) |
picks 5 times from 1, 2, ..., 52 (don't replace) |
numpy.random.choice(list(string.ascii_lowercase), size=5, replace=False) |
picks 5 random letters of the alphabet (don't replace) see Common string operations |
numpy.random.choice(range(1,4), size=5, p=p) |
picks 5 times from 1, 2, 3 with probabilities $\mathtt{p}$ (replace) |
[experiment() for _ in range(10**4)] |
gathers 104 executions of $\mathtt{experiment}$ function into array |
The Python-based Jupyter notebooks here in this repository all make use of the matplotlib.pyplot
state-based interface to matplotlib.
The %matplotlib inline
magic command provides a MATLAB-like way of plotting in the notebook.
matplotlib.pyplot
is mainly intended for interactive plots and simple cases of programmatic plot generation
If the matplotlib
colormaps are not enough, you can find all sorts of color suggestions at colorbrewer2.org
.
The single underscore variable name _
is often used to denote a "throwaway" variable whose value is not used.
For example, in a for loop, if the loop counter is never really used, you can write a for loop like this:
for _ in range(10):
print("I don't care what iteration I'm on!")
_
is also used to assign but ignore return values from functions. In the matplotlib.pyplot.subplots
example below, we ignore the fig
returned by assigning that to _
.
matplotlib
being state-based, you can continue to add to a plot until matplotlib.pyplot.show
is called; or in a Jupyter notebook, until the end of a code cell.
Command | What it does |
---|---|
x = numpy.linspace(a, b, num=n) |
create a vector of n equally spaced points from a to b |
matplotlib.pyplot.plot(x, f(x)) |
graphs the given vector x and a function f on x |
matplotlib.pyplot.scatter(x, f(x)) |
creates scatter plot of the points $(x_i, \, y_i)$ |
matplotlib.pyplot.plot(x, y, linestyle="--") |
creates line plot of the points $(x_i, \, y_i)$, with the given linestyle |
matplotlib.pyplot.scatter(x,y) |
adds the points $(x_i, \, y_i)$ to the plot |
lines(x,y) |
adds line segments through the $(x_i, \, y_i)$ to the plot |
matplotlib.pyplot.plot(x, b*x + a) |
adds the line with intercept a , slope b to the plot, given x |
matplotlib.pyplot.hist(x, bins=b, col="blue") |
blue histogram of the values in x , with b bins |
_, (ax1, ax2) = matplotlib.pyplot.subplots(1, 2) |
tells matplotlib we want 2 side-by-side plots (a 1 $\times$ 2 array of plots)where the axes assigned are ax1 and ax2 |
Command | What it does |
---|---|
x = numpy.pi |
sets $\mathtt{x}$ equal to $\pi$ |
x>3 and x<5 |
Is $\mathtt{x > 3}$ and $\mathtt{x < 5}$? ($\mathtt{True/False}$) |
x>3 or x<5 |
Is $\mathtt{x > 3}$ or $\mathtt{x < 5}$? ($\mathtt{True/False}$) |
if n>3: x = x + 1 |
adds 1 to $\mathtt{x}$ if $\mathtt{n > 3}$ |
x = x+1 if n==0 else x + 2 |
adds 1 to $\mathtt{x}$ if $\mathtt{n = 0}$, else adds 2 |
Command | What it does |
---|---|
numpy.mean(v) |
sample mean of array v |
numpy.var(v, ddof=1) |
sample variance of vector v |
numpy.std(v, ddof=1) |
sample standard deviation of vector v |
numpy.median(v) |
sample median of vector v |
numpy.percentile(v, [0, 25, 50, 75, 100]) |
min, 1st quartile, median, 3rd quartile, max of v |
numpy.percentile(v, p) |
pth sample quantile of vector v |
numpy.cov(v, w, ddof=1)[0,1] |
sample covariance of vectors v and w |
numpy.corrcoef(v, w)[0,1] |
sample correlation of vectors v and w |
scipy.stats
scipy.stats
scipy.stats
scipy.stats
for a full list of the probability distributions and statistical functions provided.Command | What it does |
---|---|
help( distribution) |
shows documentation on the distribution named |
scipy.stats.binom.pmf(k, n, p) |
PMF $P(X = k)$ for $X \sim Bin(n, p)$ |
scipy.stats.binom.cdf(x, n, p) |
CDF $P(X \leq x)$ for $X \sim Bin(n, p)$ |
scipy.stats.binom.ppf(q, n, p) |
quantile $P(X \leq x) \geq q$ for $X \sim Bin(n, p)$ |
scipy.stats.binom.rvs(n, p, size=r) |
vector of r i.i.d. $Bin(n, p)$ r.v.s |
scipy.stats.geom.pmf(k, p) |
PMF $P(X=k) = (1-p)^{k-1}\,p$ |
scipy.stats.hypergeom.pmf(k, M, n, N) |
PMF $P(X=k) = \frac{\binom{n}{k}\binom{M-n}{N-k}}{\binom{M}{N}} $ |
scipy.stats.nbinom.pmf(k, n, p) |
PMF $P(X=k) = \binom{k+n-1}{n-1} \, p^n \, (1-p)^k$ |
scipy.stats.poisson.pmf(k, mu) |
PMF $P(X=k) = e^{-\mu}\,\frac{\mu^k}{k!}$ |
scipy.stats.beta.pdf(x,a,b) |
PDF $f(x) = \frac{\Gamma(a+b)\,x^{a-1}\,(1-x)^{b-1}}{\Gamma(a)\,\Gamma(b)}$ |
scipy.stats.cauchy.pdf(x) |
PDF $f(x) \frac{1}{\pi \, (1 + x^2)}$ |
scipy.stats.chi2.pdf(x, df) |
PDF $f(x)$ for $X \sim \chi^2_{df}$ |
scipy.stats.expon.pdf(x, scale=1/b) |
PDF $f(x)$ for $X \sim Expo(b)$ |
scipy.stats.gamma.pdf(x, a, scale=1/b) |
PDF $f(x)$ for $X \sim Gamma(a, b)$ |
scipy.stats.lognorm.pdf(x, loc=m, scale=s) |
PDF $f(x)$ for $X \sim \mathcal{LN}(m, s^2)$ |
scipy.stats.norm.pdf(x, loc=m, scale=s) |
PDF $f(x)$ for $X \sim \mathcal{N}(m, s^2)$ |
scipy.stats.t.pdf(x, df) |
PDF $f(x)$ for $X \sim t_{df}$ |
scipy.stats.uniform.pdf(x, a, a+b) |
PDF $f(x)$ for $X \sim Unif(a, b)$ |
The application programming interface (API for the scipy.stats
module is fairly consistent for the distributions supported. Discrete distributions have pmf
, cdf
, and rvs
functions for probability mass, cumulative distribution, and random value generation. Similarly for continuous distributions, the functions for pdf
, cdf
, and rvs
correspond to probability density, cumulative distribution, and random value generation. Note that for Normal distributions, the mean and standard deviation need to be provided as the parameters, rather than mean and variance.
Among the multivariate distributions supported, scipy.stats.multinomial
has pmf
and rvs
for probability mass function and random value generation. scipy.stats.multivariate_normal
has pdf
, cdf
, and rvs
for probabilty density, cumulative distribution, and random value generation functions.
Joseph K. Blitzstein and Jessica Hwang, Harvard University and Stanford University, © 2019 by Taylor and Francis Group, LLC