A summary of "Probability and Statistics in Data Science using Python", offered from UCSD DSE210x
import numpy as np
import matplotlib.pyplot as plt
Why should you care about prob&stat?
Navigation software:
Search engine:
Insurance Company:
We all know that if one flips a fair coin then the outcome is "heads" or "tails with equal probabilities. What does that mean?
It means that if we flip the coin $k$ times, for some large value of $k$, say $k = 10000$. Then the number of "heads" is about $\frac{k}{2} = \frac{10000}{2} = 5000$
We will use the pseudo random number generators in numpy
to simulate the coin flips. Instead of "Heads" and "Tails", we will use $x_i = 1$ or $x_i = -1$ and consider the sum $S_{10000} = x_1 + x_2 + \dots + x_{10000}$. And we will vary the number of coin flips, which we denote by $k$.
# Generate the sum of k coin flips, repeat that n times
def generate_counts(k=1000, n=100):
# Generate a k X n matrix of $\pm$ 1 random numbers
X = 2 * (np.random.rand(k, n) > 0.5) - 1
S = np.sum(X, axis=0)
return S
k = 1000
n = 1000
counts = generate_counts(k=k, n=n)
plt.figure(figsize=[10, 4]);
plt.hist(counts);
plt.xlim([-k, k]);
plt.xlabel("sum");
plt.ylabel('count');
plt.title('Histogram of coin flip sum when flipping a fair coin %d times' % k);
plt.grid();
Note that the sum $S_{1000}$ is not exactly 0, it is only close to 0.
Using probability theory, we can calculate how small is $\vert S_k \vert$.
In later lesson, we will show that the probability that
$$ \vert S_k \vert \ge 4 \sqrt{k} $$is smaller than $2 \times 10^{-8}$ which is $0.000002\%$
At first, let's use our simulation to demonstrate that this is the case:
from math import sqrt
plt.figure(figsize=[13, 3.5]);
for j in range(2, 5):
k = 10 ** j
counts = generate_counts(k=k, n=100)
plt.subplot(130 + j - 1);
plt.hist(counts, bins=10);
d = 4 * sqrt(k)
plt.plot([-d, -d], [0, 30], 'r');
plt.plot([+d, +d], [0, 30], 'r');
plt.grid();
plt.title('%d flips, bound=$\pm$%6.1f' % (k, d));
If we scale, if we plot the full scale of these coin flips,
plt.figure(figsize=[13, 3.5]);
for j in range(2, 5):
k = 10 ** j
counts = generate_counts(k=k, n=100)
plt.subplot(130 + j - 1);
plt.hist(counts, bins=10);
plt.xlim([-k, k]);
d = 4 * sqrt(k)
plt.plot([-d, -d], [0, 30], 'r');
plt.plot([+d, +d], [0, 30], 'r');
plt.grid();
plt.title('%d flips, bound=$\pm$%6.1f' % (k, d));
We did some experiments summing $k$ random numbers:
$$ S_k = x_1 + x_2 + \dots + x_k $$$x_i = -1$ with probability $\frac{1}{2}$, $x_i = +1$ with probability $\frac{1}{2}$
Out experiments show that the sum $S_k$ is (almost) always in the range $[-4\sqrt{k}, +4\sqrt{k}]$,
$$ \text{If } k \rightarrow \infty, \frac{4 \sqrt{k}}{k} = \frac{4}{\sqrt{k}} \rightarrow 0 $$Therefore,
$$ \text{If } k \rightarrow \infty, \frac{S_k}{k} \rightarrow 0 $$It is the math involved in proving (a precise of) the statements above.
In most cases, we can approximate probabilities using simulations (Monte-Carlo simulations)
Calculating the probabilites is better because:
Probability theory computes probabilities of complex events given the underlying base probabilties
Statistics takes us in the opposite direction We are given data that was generated by a Stochastic process We infer properties of the underlying base probabilities.
We discussed the distribution of the number of heads when flipping a fair coin many times. Let's turn the question around: we flip a coin 1000 times and get 570 heads. Can we conclude that the coin is biased (not fair)? What can we conclude if we got 507 heads?
The answer uses the following logic
Recall the coin flip example used previous section.
We used $x_i = -1$ for tails and $x_i = +1$ for heads.
We looked at the sum $S_k = \sum_{i=1}^{k} x_i$, here $k=1000$.
If the number of heads is 570, then $S_{1000} = 570 - 430 = 140$.
This result is very unlikely that $\vert S_{1000} \vert \gt 4 \sqrt{k} \approx 126.5 $. And it is very unlikely that the coin is unbiased.
507 heads = 493 tails $\Rightarrow S_n = 14, 14 \ll 126.5$
So we cannot conclude that coin is biased.
The probability that an unbiased coin would generate a sequence with 570 or more heads is extremely small. From which we can conclude, with high confidence, that the coin is biased.
On the other hand, $\vert S_{1000} \vert \gt 14$ is quite likely. So getting 507 heads does not provide evidence that the coin is biased.
You might ask "why should I care whether a coin is biased?"
A common practice when optimizing a web page is to perform A/B tests.
Statistics is about analyzing real-world data and drawing conclusions Examples includes:
Suppose we have three cards in a hat:
from random import random
red_bck="\x1b[41m%s\x1b[0m"
blue_bck="\x1b[44m%s\x1b[0m"
red=red_bck%'R'
black=blue_bck%'B'
Cards=[(red,black),(red,red),(black,black)]
counts={'same':0,'different':0}
for j in range(50):
# Select a random card
i = int(random() * 3.)
side = int(random() * 2.)
C = Cards[i]
# Select which side to be "up"
if side == 1:
C = (C[1], C[0])
same = 'same' if C[0] == C[1] else 'different'
# Count the number of times the two sides are the same or different.
counts[same] += 1
print(''.join(C) + ' %-9s' % same, end='')
if(j + 1) % 5 == 0:
print()
print()
print(counts)
BB same BB same RB differentBR differentRR same BB same BR differentRR same RR same RB different RR same BR differentRR same RR same RR same RR same RR same RR same RR same BR different BB same BR differentBB same RR same RR same BB same BB same BB same RR same BR different BB same RR same BB same BB same BB same RR same RR same RB differentRR same BB same RR same RR same RR same BB same RR same BB same RB differentRR same BB same RR same {'same': 40, 'different': 10}
In simulation: the two sides have the same color about twice the number of times that they have different color.
You are twice as likely to lose as you are to win.
On Average, you lose 33 cents per iteration:
If we pick a card at random 2/3 of the time, we pick a card where the two sides have the same color, and only 1/3 where the color is different.
In this excercise you will write code to estimate the probability that $n$ flips of a fair coin will result in number of "heads"
between $k_1$ and $k_2$.
You should write the body of two functions:
seq_sum(n)
: generates a random sequence of coin flips and counts the number of heads.estimate_prob(n,k1,k2,m)
: Using calls to seq_sum
, estimate the probability of the number of heads being between $k_1$ and $k_2$.np.random.rand()
0.020644852872238384
np.random.rand(4)
array([0.95931166, 0.52807626, 0.08801368, 0.64187303])
Write a function, seq_sum(n)
, which generates $n$ random coin flips from a fair coin and then returns the number of heads. A fair coin is defined to be a coin where $P($heads$)=\frac{1}{2}$
The output type should be a numpy integer, hint: use np.random.rand()
* Code: *
x = seq_sum(100)
print x
print [seq_sum(2) for x in range(20)]
* Output: *
49
[0, 1, 1, 1, 1, 2, 1, 2, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 1, 1]
def seq_sum(n):
""" input: n, generate a sequence of n random coin flips
output: return the number of heads
Hint: For simplicity, use 1,0 to represent head,tails
"""
seq_list = np.array([np.random.rand() for _ in range(n)])
return np.sum(seq_list >= 0.5)
x = seq_sum(100)
print(x)
assert np.unique([seq_sum(2) for x in range(0,200)]).tolist() == [0, 1, 2]
47
Write a function, estimate_prob(n,k1,k2,m)
, that uses seq_sum(n)
to estimate the following probability:
The function should estimate the probability by running $m$ different trials of seq_sum(n)
, probably using a for
loop.
In order to receive full credit estimate_prob MUST call seq_sum (aka: seq_sum is located inside the estimate_prob function)
* Code: *
x = estimate_prob(100,45,55,1000)
print(x)
print type(x)
* Output: *
0.686
<type 'float'>
def estimate_prob(n,k1,k2,m):
"""Estimate the probability that n flips of a fair coin result in k1 to k2 heads
n: the number of coin flips (length of the sequence)
k1,k2: the trial is successful if the number of heads is
between k1 and k2-1
m: the number of trials (number of sequences of length n)
output: the estimated probability
"""
success = 0
for _ in range(m):
seqs = seq_sum(n)
if k1 <= seqs < k2:
success += 1
return success / m
x = estimate_prob(100,45,55,1000)
print(x)
assert 'float' in str(type(x))
0.688
We can now check how to see how close these estimates are to the true probabilities.
These helper functions are used to calculate the actual probabilities. They are used to test your code.
It is not required that you understand how they work.
def calc_prob(n,k1,k2):
"""Calculate the probability using a normal approximation"""
n=float(n);k1=float(k1);k2=float(k2)
z1=(k1-0.5*n)/(sqrt(n)/2)
z2=(k2-0.5*n)/(sqrt(n)/2)
return (erf(z2/sqrt(2))-erf(z1/sqrt(2)))/2
from math import erf,sqrt
def evaluate(n,q1,q2,m,r=100):
"""Run calc_range many times and test whether the estimates are consistent with calc_prob"""
k1=int(q1*n)
k2=int(q2*n)
p=calc_prob(n,k1,k2)
std=sqrt(p*(1-p)/m)
print('computed prob=%5.3f, std=%5.3f'%(p,std))
L=[estimate_prob(n,k1,k2,m) for i in range(r)]
med=np.median(L)
print('ran estimator %d times, with parameters n=%d,k1=%d,k2=%d,m=%d'%(r,n,k1,k2,m))
print('median of estimates=%5.3f, error of median estimator=%5.3f, std= %f5.3'%(med,med-p,std))
return L,med,p,std,abs((med-p)/std)
def test_report_assert(n,q1,q2,m,r=100):
k1=int(q1*n)
k2=int(q2*n)
L,med,p,std,norm_err=evaluate(n,q1,q2,m,r=100)
plt.hist(L);
plt.plot([p,p],plt.ylim(),'r',label='true prob')
plt.plot([med,med],plt.ylim(),'k',label='median of %d estimates'%r)
mid_y=np.mean(plt.ylim())
plt.plot([p-std,p+std],[mid_y,mid_y],'g',label='+-std')
plt.legend();
print('normalized error of median=',norm_err,'should be <1.0')
plt.title('r=%d,n=%d,k1=%d,k2=%d,m=%d,\nnorm_err=%4.3f'%(r,n,k1,k2,m,norm_err))
assert norm_err<1.0
m=100
i=1
plt.figure(figsize=[10,12])
for n in [100,1000]:
for q1,q2 in [(0.4,0.6),(0.55,1.00),(0.47,0.499)]:
fig=plt.subplot(3,2,i)
print('#### test no.',i)
i+=1
test_report_assert(n,q1,q2,m,r=100)
plt.tight_layout()
#### test no. 1 computed prob=0.954, std=0.021 ran estimator 100 times, with parameters n=100,k1=40,k2=60,m=100 median of estimates=0.950, error of median estimator=-0.004, std= 0.0208405.3 normalized error of median= 0.21591965634481614 should be <1.0 #### test no. 2 computed prob=0.159, std=0.037 ran estimator 100 times, with parameters n=100,k1=55,k2=100,m=100 median of estimates=0.185, error of median estimator=0.026, std= 0.0365355.3 normalized error of median= 0.7210739298262482 should be <1.0 #### test no. 3 computed prob=0.146, std=0.035 ran estimator 100 times, with parameters n=100,k1=47,k2=49,m=100 median of estimates=0.130, error of median estimator=-0.016, std= 0.0353595.3 normalized error of median= 0.46627417809069077 should be <1.0 #### test no. 4 computed prob=1.000, std=0.000 ran estimator 100 times, with parameters n=1000,k1=400,k2=600,m=100 median of estimates=1.000, error of median estimator=0.000, std= 0.0000025.3 normalized error of median= 0.0001593621193426113 should be <1.0 #### test no. 5 computed prob=0.001, std=0.003 ran estimator 100 times, with parameters n=1000,k1=550,k2=1000,m=100 median of estimates=0.000, error of median estimator=-0.001, std= 0.0027975.3 normalized error of median= 0.27987751426889984 should be <1.0 #### test no. 6 computed prob=0.446, std=0.050 ran estimator 100 times, with parameters n=1000,k1=470,k2=499,m=100 median of estimates=0.440, error of median estimator=-0.006, std= 0.0497065.3 normalized error of median= 0.11861045590584546 should be <1.0