Copyright 2019 Allen Downey
BSD 3-clause license: https://opensource.org/licenses/BSD-3-Clause
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('white')
import matplotlib.pyplot as plt
import inspect
def psource(obj):
"""Prints the source code for a given object.
obj: function or method object
"""
print(inspect.getsource(obj))
For comments or questions about this section, see this issue.
The Cdf
class inherits from pd.Series
. The __init__
method is essentially unchanged, but it includes a workaround for what I think is bad behavior.
from empiricaldist import Cdf
psource(Cdf.__init__)
def __init__(self, *args, **kwargs): """Initialize a Pmf. Note: this cleans up a weird Series behavior, which is that Series() and Series([]) yield different results. See: https://github.com/pandas-dev/pandas/issues/16737 """ if args or ('index' in kwargs): super().__init__(*args, **kwargs) else: underride(kwargs, dtype=np.float64) super().__init__([], **kwargs)
You can create an empty Cdf
and then add elements.
Here's a Cdf
that representat a four-sided die.
d4 = Cdf()
d4[1] = 1
d4[2] = 2
d4[3] = 3
d4[4] = 4
d4
probs | |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
In a normalized Cdf
, the last probability is 1.
normalize
makes that true. The return value is the total probability before normalizing.
psource(Cdf.normalize)
def normalize(self): """Make the probabilities add up to 1 (modifies self). :return: normalizing constant """ total = self.ps[-1] self /= total return total
d4.normalize()
4
Now the Cdf is normalized.
d4
probs | |
---|---|
1 | 0.25 |
2 | 0.50 |
3 | 0.75 |
4 | 1.00 |
For comments or questions about this section, see this issue.
In a Cdf
the index contains the quantities (qs
) and the values contain the probabilities (ps
).
These attributes are available as properties that return arrays (same semantics as the Pandas values
property)
d4.qs
array([1, 2, 3, 4])
d4.ps
array([0.25, 0.5 , 0.75, 1. ])
For comments or questions about this section, see this issue.
Because Cdf
is a Series
you can initialize it with any type Series.__init__
can handle.
Here's an example with a dictionary.
d = dict(a=1, b=2, c=3)
cdf = Cdf(d)
cdf.normalize()
cdf
probs | |
---|---|
a | 0.333333 |
b | 0.666667 |
c | 1.000000 |
Here's an example with two lists.
qs = [1,2,3,4]
ps = [0.25, 0.5, 0.75, 1.0]
d4 = Cdf(ps, index=qs)
d4
probs | |
---|---|
1 | 0.25 |
2 | 0.50 |
3 | 0.75 |
4 | 1.00 |
You can copy a Cdf
like this.
d4_copy = Cdf(d4)
d4_copy
probs | |
---|---|
1 | 0.25 |
2 | 0.50 |
3 | 0.75 |
4 | 1.00 |
However, you have to be careful about sharing. In this example, the copies share the arrays:
d4.index is d4_copy.index
True
d4.ps is d4_copy.ps
True
You can avoid sharing with copy=True
d4_copy = Cdf(d4, copy=True)
d4_copy
probs | |
---|---|
1 | 0.25 |
2 | 0.50 |
3 | 0.75 |
4 | 1.00 |
d4.index is d4_copy.index
False
d4.ps is d4_copy.ps
False
Or by calling copy
explicitly.
d4_copy = d4.copy()
d4_copy
probs | |
---|---|
1 | 0.25 |
2 | 0.50 |
3 | 0.75 |
4 | 1.00 |
d4.index is d4_copy.index
False
d4.ps is d4_copy.ps
False
For comments or questions about this section, see this issue.
Cdf
provides _repr_html_
, so it looks good when displayed in a notebook.
psource(Cdf._repr_html_)
def _repr_html_(self): """Returns an HTML representation of the series. Mostly used for Jupyter notebooks. """ df = pd.DataFrame(dict(probs=self)) return df._repr_html_()
Cdf
provides plot
, which plots the Cdf as a line.
psource(Cdf.plot)
def plot(self, **options): """Plot the Cdf as a line. :param options: passed to plt.plot :return: """ underride(options, label=self.name) plt.plot(self.qs, self.ps, **options)
def decorate_dice(title):
"""Labels the axes.
title: string
"""
plt.xlabel('Outcome')
plt.ylabel('CDF')
plt.title(title)
d4.plot()
decorate_dice('One die')
Cdf
also provides step
, which plots the Cdf as a step function.
psource(Cdf.step)
def step(self, **options): """Plot the Cdf as a step function. :param options: passed to plt.step :return: """ underride(options, label=self.name, where="post") plt.step(self.qs, self.ps, **options)
d4.step()
decorate_dice('One die')
For comments or questions about this section, see this issue.
The following function makes a Cdf
object from a sequence of values.
psource(Cdf.from_seq)
@staticmethod def from_seq(seq, normalize=True, sort=True, **options): """Make a CDF from a sequence of values. seq: any kind of sequence normalize: whether to normalize the Cdf, default True sort: whether to sort the Cdf by values, default True options: passed to the pd.Series constructor :return: CDF object """ pmf = Pmf.from_seq(seq, normalize=False, sort=sort, **options) return pmf.make_cdf(normalize=normalize)
cdf = Cdf.from_seq(list('allen'))
cdf
probs | |
---|---|
a | 0.2 |
e | 0.4 |
l | 0.8 |
n | 1.0 |
cdf = Cdf.from_seq(np.array([1, 2, 2, 3, 5]))
cdf
probs | |
---|---|
1 | 0.2 |
2 | 0.6 |
3 | 0.8 |
5 | 1.0 |
For comments or questions about this section, see this issue.
Cdf
inherits [] from Series, so you can look up a quantile and get its cumulative probability.
d4[1]
0.25
d4[4]
1.0
Cdf
objects are mutable, but in general the result is not a valid Cdf.
d4[5] = 1.25
d4
probs | |
---|---|
1 | 0.25 |
2 | 0.50 |
3 | 0.75 |
4 | 1.00 |
5 | 1.25 |
d4.normalize()
d4
probs | |
---|---|
1 | 0.2 |
2 | 0.4 |
3 | 0.6 |
4 | 0.8 |
5 | 1.0 |
For comments or questions about this section, see this issue.
Evaluating a Cdf
forward maps from a quantity to its cumulative probability.
d6 = Cdf.from_seq([1,2,3,4,5,6])
d6.forward(3)
array(0.5)
forward
interpolates, so it works for quantities that are not in the distribution.
d6.forward(3.5)
array(0.5)
d6.forward(0)
array(0.)
d6.forward(7)
array(1.)
__call__
is a synonym for forward
, so you can call the Cdf
like a function (which it is).
d6(1.5)
array(0.16666667)
forward
can take an array of quantities, too.
def decorate_cdf(title):
"""Labels the axes.
title: string
"""
plt.xlabel('Quantity')
plt.ylabel('CDF')
plt.title(title)
qs = np.linspace(0, 7)
ps = d6(qs)
plt.plot(qs, ps)
decorate_cdf('Forward evaluation')
Cdf
also provides inverse
, which computes the inverse Cdf
:
d6.inverse(0.5)
array(3.)
quantile
is a synonym for inverse
d6.quantile(0.5)
array(3.)
inverse
and quantile
work with arrays
ps = np.linspace(0, 1)
qs = d6.quantile(ps)
plt.plot(qs, ps)
decorate_cdf('Inverse evaluation')
These functions provide a simple way to make a Q-Q plot.
Here are two samples from the same distribution.
cdf1 = Cdf.from_seq(np.random.normal(size=100))
cdf2 = Cdf.from_seq(np.random.normal(size=100))
cdf1.plot()
cdf2.plot()
decorate_cdf('Two random samples')
Here's how we compute the Q-Q plot.
def qq_plot(cdf1, cdf2):
"""Compute results for a Q-Q plot.
Evaluates the inverse Cdfs for a
range of cumulative probabilities.
:param cdf1: Cdf
:param cdf2: Cdf
:return: tuple of arrays
"""
ps = np.linspace(0, 1)
q1 = cdf1.quantile(ps)
q2 = cdf2.quantile(ps)
return q1, q2
The result is near the identity line, which suggests that the samples are from the same distribution.
q1, q2 = qq_plot(cdf1, cdf2)
plt.plot(q1, q2)
plt.xlabel('Quantity 1')
plt.ylabel('Quantity 2')
plt.title('Q-Q plot');
Here's how we compute a P-P plot
def pp_plot(cdf1, cdf2):
"""Compute results for a P-P plot.
Evaluates the Cdfs for all quantities in either Cdf.
:param cdf1: Cdf
:param cdf2: Cdf
:return: tuple of arrays
"""
qs = cdf1.index.union(cdf2)
p1 = cdf1(qs)
p2 = cdf2(qs)
return p1, p2
And here's what it looks like.
p1, p2 = pp_plot(cdf1, cdf2)
plt.plot(p1, p2)
plt.xlabel('Cdf 1')
plt.ylabel('Cdf 2')
plt.title('P-P plot');
For comments or questions about this section, see this issue.
Cdf
overrides the statistics methods to compute mean
, median
, etc.
psource(Cdf.mean)
def mean(self): """Expected value. :return: float """ return self.make_pmf().mean()
d6.mean()
3.5
psource(Cdf.var)
def var(self): """Variance. :return: float """ return self.make_pmf().var()
d6.var()
2.916666666666667
psource(Cdf.std)
def std(self): """Standard deviation. :return: float """ return self.make_pmf().std()
d6.std()
1.7078251276599332
For comments or questions about this section, see this issue.
choice
chooses a random values from the Cdf, following the API of np.random.choice
psource(Cdf.choice)
def choice(self, *args, **kwargs): """Makes a random sample. Uses the probabilities as weights unless `p` is provided. args: same as np.random.choice options: same as np.random.choice :return: NumPy array """ # TODO: Make this more efficient by implementing the inverse CDF method. pmf = self.make_pmf() return pmf.choice(*args, **kwargs)
d6.choice(size=10)
array([1, 5, 4, 6, 3, 1, 2, 4, 2, 5])
sample
chooses a random values from the Cdf
, following the API of pd.Series.sample
psource(Cdf.sample)
def sample(self, *args, **kwargs): """Makes a random sample. Uses the probabilities as weights unless `weights` is provided. This function returns an array containing a sample of the quantities in this Pmf, which is different from Series.sample, which returns a Series with a sample of the rows in the original Series. args: same as Series.sample options: same as Series.sample :return: NumPy array """ # TODO: Make this more efficient by implementing the inverse CDF method. pmf = self.make_pmf() return pmf.sample(*args, **kwargs)
d6.sample(n=10, replace=True)
array([1, 1, 6, 4, 4, 5, 2, 2, 2, 4])
For comments or questions about this section, see this issue.
Cdf
provides add_dist
, which computes the distribution of the sum.
The implementation uses outer products to compute the convolution of the two distributions.
psource(Cdf.add_dist)
def add_dist(self, x): """Computes the distribution of the sum of values drawn from self and x. x: Distribution, scalar, or sequence :return: new Distribution, same subtype as self """ pmf = self.make_pmf() res = pmf.add_dist(x) return self.make_same(res)
psource(Cdf.make_same)
def make_same(self, dist): """Convert the given dist to Cdf :param dist: :return: Cdf """ return dist.make_cdf()
Here's the distribution of the sum of two dice.
d6 = Cdf.from_seq([1,2,3,4,5,6])
twice = d6.add_dist(d6)
twice
probs | |
---|---|
2 | 0.027778 |
3 | 0.083333 |
4 | 0.166667 |
5 | 0.277778 |
6 | 0.416667 |
7 | 0.583333 |
8 | 0.722222 |
9 | 0.833333 |
10 | 0.916667 |
11 | 0.972222 |
12 | 1.000000 |
twice.step()
decorate_dice('Two dice')
twice.mean()
7.000000000000002
To add a constant to a distribution, you could construct a deterministic Pmf
const = Cdf.from_seq([1])
d6.add_dist(const)
probs | |
---|---|
2 | 0.166667 |
3 | 0.333333 |
4 | 0.500000 |
5 | 0.666667 |
6 | 0.833333 |
7 | 1.000000 |
But add_dist
also handles constants as a special case:
d6.add_dist(1)
probs | |
---|---|
2 | 0.166667 |
3 | 0.333333 |
4 | 0.500000 |
5 | 0.666667 |
6 | 0.833333 |
7 | 1.000000 |
Other arithmetic operations are also implemented
d4 = Cdf.from_seq([1,2,3,4])
d6.sub_dist(d4)
probs | |
---|---|
-3 | 0.041667 |
-2 | 0.125000 |
-1 | 0.250000 |
0 | 0.416667 |
1 | 0.583333 |
2 | 0.750000 |
3 | 0.875000 |
4 | 0.958333 |
5 | 1.000000 |
d4.mul_dist(d4)
probs | |
---|---|
1 | 0.0625 |
2 | 0.1875 |
3 | 0.3125 |
4 | 0.5000 |
6 | 0.6250 |
8 | 0.7500 |
9 | 0.8125 |
12 | 0.9375 |
16 | 1.0000 |
d4.div_dist(d4)
probs | |
---|---|
0.250000 | 0.0625 |
0.333333 | 0.1250 |
0.500000 | 0.2500 |
0.666667 | 0.3125 |
0.750000 | 0.3750 |
1.000000 | 0.6250 |
1.333333 | 0.6875 |
1.500000 | 0.7500 |
2.000000 | 0.8750 |
3.000000 | 0.9375 |
4.000000 | 1.0000 |
Pmf
implements comparison operators that return probabilities.
You can compare a Pmf
to a scalar:
d6.lt_dist(3)
0.3333333333333333
d4.ge_dist(2)
0.75
Or compare Pmf
objects:
d4.gt_dist(d6)
0.25
d6.le_dist(d4)
0.41666666666666663
d4.eq_dist(d6)
0.16666666666666666
Interestingly, this way of comparing distributions is nontransitive.
A = Cdf.from_seq([2, 2, 4, 4, 9, 9])
B = Cdf.from_seq([1, 1, 6, 6, 8, 8])
C = Cdf.from_seq([3, 3, 5, 5, 7, 7])
A.gt_dist(B)
0.5555555555555556
B.gt_dist(C)
0.5555555555555556
C.gt_dist(A)
0.5555555555555556