This Jupyter notebook is the Python equivalent of the R code in section 9.8 R, pp. 408 - 410, Introduction to Probability, 1st Edition, Blitzstein & Hwang.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
We can use simulation to show that in Example 9.1.7, the example of bidding on a mystery prize with unknown value, any bid will lead to a negative payout on average. First choose a bid b
(we chose 0.6); then simulate a large number of hypothetical mystery prizes using rvs
in scipy.stats.uniform
, and store them in v
:
# seed the random number generator
np.random.seed(5702887)
from scipy.stats import uniform
b = 0.6
nsim = 10**5
v = uniform.rvs(size=nsim)
The bid is accepted if b > (2/3)*v
. To get the average profit conditional on an accepted bid, we use numpy.where
to find the indices in v
where the values satisfy the condition:
bid_accepted = v[np.where(b > (2/3)*v)]
avg_profit = np.mean(bid_accepted) - b
print('average profit conditional on an accepted bid: {}'.format(avg_profit))
average profit conditional on an accepted bid: -0.1499047321616897
This value is negative regardless of b
, as you can check by experimenting with different values ofb
. Try changing the value for b
in the code block above, and then hit SHIFT+ENTER to re-run the code blocks. You can also try changing the seed value for random number generation as well.
To verify the results of Example 9.1.9, we can start by generating a long sequence of fair coin tosses. This is done with the numpy.random.choice
function. numpy.random.choice
will chooose with replacement items from the specified list of H
and T
, returning a sequence of length specified bye the size
parameter. We use str.join
, passing in an empty str
argument ''
to be interspersed between the elements in the given sequence. This results in a single string of H
's and T
's:
''.join(np.random.choice(['H', 'T'], size=100))
'HHHHTHHTTHTHTHTHTTHHHHTTHTTHHHHTHHHHHTHHHHHTTHHHHHHTTHHTHHHTTHTHHHHTTHHHTHTHTTHHTHTTHHHTTTHHTHTTTHHH'
A sequence of length 100 is enough to virtually guarantee that both HH
and HT
will have appeared at least once.
To determine how many tosses are required on average to see HH
and HT
, we need to generate many sequences of coin tosses. For this, we use our familiar friend Python list comprehensions to execute a large number of iterations:
np.random.seed(9227465)
r = [''.join(np.random.choice(['H', 'T'], size=100)) for _ in range(10**3)]
Now r
contains a thousand sequences of coin tosses, each of length 100. To find the first appearance of HH
in each of these sequences, it is easiest to use regular expressions in Python's re
library. We load the re
library with the import
keyword, and use re.search
to obtain the start and end indices of the first appearance of HH
in each sequence ht_seq
in r
.
import re
# to learn more about re, un-comment ouf the following line
#print(re.__doc__)
target = 'HH'
t = [re.search(target, ht_seq).span() for ht_seq in r]
t = np.array(t)
print('matrix t has shape: {}'.format(t.shape))
t[0:10, :]
matrix t has shape: (1000, 2)
array([[1, 3], [1, 3], [4, 6], [2, 4], [3, 5], [0, 2], [4, 6], [0, 2], [5, 7], [1, 3]])
The code above creates a two-column matrix t
, whose columns contain the starting and ending positions of the first appearance of HH
in each sequence of coin tosses. (Use t[0:10, :]
to display the first 10 rows of the matrix and get an idea of what your results look like.) What we want are the ending positions, given by the second column. In particular, we want the average value of the second column, which is an approximation of the average waiting time for HH
:
mean = t[:, 1].mean()
print('average waiting time for \'{}\': {}'.format(target, mean))
average waiting time for 'HH': 6.16
Is your answer around 6? Trying again with HT
instead of HH
, is your answer around 4? You can change the value of the target
variable in the earlier code block, and then hit SHIFT+ENTER to execute the code blocks again.
In Example 9.3.10, we derived formulas for the slope and intercept of a linear regression model, which can be used to predict a response variable using an explanatory variable. Let's try to apply these formulas to a simulated dataset:
np.random.seed(14930352)
from scipy.stats import norm
x = norm.rvs(size=100)
y = 3 + 5*x + norm.rvs(size=100)
The array x
contains 100 realizations of the random variable $X \sim N(0, 1)$ and the array y
contains 100 realizations of the random variable $Y = a + bX + \epsilon$, where $\epsilon \sim N(0,1)$. As we can see, the true values of $a$ and $b$ for this dataset are 3 and 5, respectively. We can visualize the data as a scatterplot with matplotlib.pyplot.scatter(x,y)
.
# numpy.cov(x, y) returns a 2 x 2 covariance matrix
# cov(x,x) cov(x,y)
# cov(y,x) cov(y,y)
cov_xy = np.cov(x, y, ddof=1)[0][1]
var_x = np.var(x, ddof=1)
b = cov_xy / var_x
a = np.mean(y) - b*np.mean(x)
print('b = {}'.format(b))
print('a = {}'.format(a))
b = 4.869568946916432 a = 2.9953152340248232
Here numpy.cov(x, y, ddof=1)[0][1]
, numpy.var(x, ddof=1)
, and numpy.mean(x)
provide the sample covariance, sample variance, and sample mean, estimating the quantities $Cov(X, Y), Var(X)$, and $\mathbb{E}(X)$, respectively. (We have discussed sample mean and sample variance in detail in earlier chapters. Sample covariance is defined analogously, and is a natural way to estimate the true covariance.)
You should find that b
is close to 5 and a
is close to 3. These estimated values define the line of best fit. We use yet another list comprehension to calculate y
values corresponding to x
on the best-fit line using a
and b
, and then matplotlit.pyplot.plot
lets use render the line of best fit on top of our scatterplot:
plt.scatter(x, y, color='#91bfdb')
abline_values = [b * val + a for val in x]
plt.plot(x, abline_values, lw=2.2, alpha=0.8, color='#fc8d59')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
© Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science).