#!/usr/bin/env python
# coding: utf-8

# In[1]:


from IPython.display import Image
Image('../../Python_probability_statistics_machine_learning_2E.png',width=200)


# In[2]:


import numpy as np
np.random.seed(1234)


# The estimation problem starts with the desire to infer something meaningful
# from
# data.  For parametric estimation, the strategy is to postulate a model for
# the
# data and then use the data to fit model parameters.  This leads to two
# fundamental questions: where to get the model and how to estimate the
# parameters? The first question is best answered by the maxim: *all models are
# wrong, some are useful*. In other words, choosing a model depends as much on
# the
# application as on the model itself. Think about models as building
# different
# telescopes to view the sky. No one would ever claim that the
# telescope generates
# the sky! It is same with data models. Models give us
# multiple perspectives on
# the data that themselves are proxies for some deeper
# underlying phenomenon.
# Some
# categories of data may be more commonly studied using certain types of
# models,
# but this is usually very domain-specific and ultimately depends on the
# aims of
# the analysis. In some cases, there may be strong physical reasons
# behind
# choosing a model.  For example, one could postulate that the model is
# linear
# with some noise as in the following:
# 
# $$
# Y = a X + \epsilon
# $$
# 
#  which basically
# says that you, as the experimenter, dial in some
# value for $X$
# and then read off
# something directly proportional to $X$ as the
# measurement,
# $Y$, plus some
# additive noise that you attribute to jitter in the
# apparatus.
# Then, the next
# step is to estimate the paramater $a$ in the model,
# given some
# postulated claim
# about the nature of $\epsilon$. How to compute the
# model
# parameters depends on
# the particular methodology. The two broad rubrics
# are
# parametric and non-
# parametric estimation. In the former, we assume we know
# the
# density function of
# the data and then try to derive the embedded parameters
# for
# it. In the latter,
# we claim only to know that the density function is a
# member
# of a broad class of
# density functions and then use the data
# to characterize a
# member of that class.
# Broadly speaking, the former consumes
# less data than the
# latter, because there
# are fewer unknowns to compute from
# the data.
# 
# Let's
# concentrate on parametric
# estimation for now. The tradition is to denote
# the
# unknown parameter to be
# estimated as $\theta$ which is a member of a large
# space
# of alternates,
# $\Theta$. To judge between potential $\theta$ values, we
# need an
# objective
# function, known as a *risk* function,
# $L(\theta,\hat{\theta})$, where
# $\hat{\theta}(\mathbf{x})$ is an
# estimate for the unknown $\theta$ that is
# derived from the available
# data $\mathbf{x}$. The most common and useful risk
# function is the
# squared error loss,
# 
# $$
# L(\theta,\hat{\theta}) =
# (\theta-\hat{\theta})^2
# $$
# 
#  Although neat, this is not practical because we
# need to know the
# unknown
# $\theta$ to compute it. The other problem is because
# $\hat{\theta}$ is
# a
# function of the observed data, it is also a random variable
# with its own
# probability density function.  This leads to the notion of the
# *expected risk*
# function,
# 
# $$
# R(\theta,\hat{\theta}) =
# \mathbb{E}_\theta(L(\theta,\hat{\theta})) = \int
# L(\theta,\hat{\theta}(\mathbf{x})) f(\mathbf{x};\theta) d \mathbf{x}
# $$
# 
#  In
# other words, given a fixed $\theta$, integrate over the
# probability density
# function of the data, $f(\mathbf{x})$,  to compute the
# risk. Plugging in for the
# squared error loss,  we compute the
# mean squared error,
# 
# $$
# \mathbb{E}_\theta(\theta-\hat{\theta})^2 =\int (\theta-\hat{\theta})^2
# f(\mathbf{x};\theta) d \mathbf{x}
# $$
# 
#  This has the important factorization into
# the *bias*,
# 
# $$
# \texttt{bias} = \mathbb{E}_\theta(\hat{\theta})-\theta
# $$
# with the corresponding variance, $\mathbb{V}_\theta(\hat{\theta})$ as
# in the
# following *mean squared error* (MSE):
# 
# $$
# \mathbb{E}_\theta(\theta-\hat{\theta})^2=
# \texttt{bias}^2+\mathbb{V}_\theta(\hat{\theta})
# $$
# 
#  This is an important
# trade-off that we will return to repeatedly. The
# idea is
# the bias is nonzero
# when the estimator $\hat{\theta}$, integrated
# over all
# possible data,
# $f(\mathbf{x})$, does not equal the underlying target
# parameter
# $\theta$. In
# some sense, the estimator misses the target, no matter
# how much
# data is used.
# When the bias equals zero, the estimated is *unbiased*.
# For fixed
# MSE, low bias
# implies high variance and vice-versa. This trade-off
# was once not
# emphasized and
# instead much attention was paid to the smallest
# variance of
# unbiased estimators
# (see Cramer-Rao bounds). In practice,
# understanding and
# exploiting the trade-off
# between bias and variance and
# reducing the MSE is more
# important.
# 
# With all this
# set up, we can now ask how bad can bad get by
# examining *minimax* risk,
# 
# $$
# R_{\texttt{mmx}} = \inf_{\hat{\theta}} \sup_\theta R(\theta,\hat{\theta})
# $$
# where the $\inf$ is take over all estimators.  Intuitively, this
# means if we
# found the worst possible $\theta$ and swept over all possible
# parameter
# estimators $\hat{\theta}$, and then took the smallest possible risk
# we could
# find, we would have the minimax risk. Thus, an estimator,
# $\hat{\theta}_{\texttt{mmx}}$, is a *minimax estimator* if it achieves this
# feat,
# 
# $$
# \sup_\theta R(\theta,\hat{\theta}_{\texttt{mmx}})
# =\inf_{\hat{\theta}}
# \sup_\theta R(\theta,\hat{\theta})
# $$
# 
#  In other words,
# even in the face of the worst $\theta$ (i.e., the
# $\sup_\theta$),
# $\hat{\theta}_{\texttt{mmx}}$ still achieves the minimax
# risk. There is a
# greater theory that revolves around minimax estimators of
# various kinds, but
# this is far beyond our scope here. The main thing to focus
# on
# is that under
# certain technical but easily satisfiable conditions, the
# maximum
# likelihood
# estimator is approximately minimax. Maximum likelihood is
# the subject
# of the
# next section.  Let's get started with the simplest
# application: coin-
# flipping.
# ## Setting up the Coin Flipping Experiment
# 
# Suppose we have coin and
# want to
# estimate the probability of heads ($p$) for
# it. We model the
# distribution of
# heads and tails as a Bernoulli distribution
# with the following
# probability mass
# function:
# 
# $$
# \phi(x)= p^x (1-p)^{(1-x)}
# $$
# 
#  where $x$ is the outcome, *1* for
# heads and *0* for tails. Note that
# maximum
# likelihood is a parametric method
# that requires the specification of a
# particular model for which we will compute
# embedded parameters. For $n$
# independent flips, we have the joint density as the
# product of $n$ of
# these
# functions as in,
# 
# $$
# \phi(\mathbf{x})=\prod_{i=1}^n
# p^x_i (1-p)^{(1-x_i)}
# $$
# 
#  The following is the *likelihood function*,
# 
# $$
# \mathcal{L}(p ; \mathbf{x})= \prod_{i=1}^n p^{ x_i }(1-p)^{1-x_i}
# $$
# 
#  This is
# basically notation. We have just renamed the
# previous equation to
# emphasize the
# $p$ parameter, which is what
# we want to estimate.
# 
# The principle
# of *maximum
# likelihood* is to maximize the likelihood as the
# function of $p$
# after plugging
# in all of the $x_i$ data. We then call this
# maximizer $\hat{p}$
# which is a
# function of the observed $x_i$ data, and as
# such, is a random
# variable with its
# own distribution. This method therefore
# ingests data and an
# assumed model for
# the probability density, and produces a
# function that
# estimates the embedded
# parameter in the assumed probability
# density.  Thus,
# maximum likelihood
# generates the *functions* of data that we
# need in order to
# get at the underlying
# parameters of the model. Note that there
# is no limit to
# the ways we can
# functionally manipulate the data we have
# collected. The maximum
# likelihood
# principle gives us a systematic method for
# constructing these
# functions subject
# to the assumed model. This is a point
# worth emphasizing: the
# maximum likelihood
# principle yields functions as
# solutions the same way solving
# differential
# equations yields functions as
# solutions. It is very, very much
# harder to produce
# a function than to produce a
# value as a solution, even with
# the assumption of a
# convenient probability
# density.  Thus, the power of the
# principle is that you
# can construct such
# functions subject to the model
# assumptions.
# 
# ### Simulating the Experiment
# 
# We need the following code to
# simulate coin flipping.

# In[3]:


from scipy.stats import bernoulli 
p_true=1/2.0         # estimate this!
fp=bernoulli(p_true) # create bernoulli random variate
xs = fp.rvs(100)     # generate some samples
print (xs[:30])        # see first 30 samples


# Now, we can write out the likelihood function using Sympy.  Note
# that we give
# the Sympy variables the `positive=True` attribute upon
# construction because this
# eases Sympy's internal simplification algorithms.

# In[4]:


import sympy
x,p,z=sympy.symbols('x p z', positive=True)
phi=p**x*(1-p)**(1-x) # distribution function
L=np.prod([phi.subs(x,i) for i in xs]) # likelihood function 
print (L) # approx 0.5?


# Note that, once we plug in the data, the likelihood function is
# solely a
# function of the unknown parameter ($p$ in this case).  The following
# code uses
# calculus to find the extrema of the likelihood function.  Note that
# taking the
# `log` of $L$ makes the maximization problem tractable but doesn't
# change the
# extrema.

# In[5]:


logL=sympy.expand_log(sympy.log(L))
sol,=sympy.solve(sympy.diff(logL,p),p)
print (sol)


# **Programming Tip.**
# 
# Note that `sol,=sympy.solve` statement includes
# a comma
# after the `sol` variable. This is because the `solve`
# function returns a list
# containing a single element. Using
# this assignment unpacks that single element
# into the `sol` variable
# directly. This is another one of the many small
# elegancies of Python.
# 
#  
# 
# The following code generates
# [Figure](#fig:Maximum_likelihood_10_2).

# In[6]:


from matplotlib.pylab import subplots
fig,ax=subplots()
x=np.linspace(0.001,1-.001,100)
ax.plot(x,list(map(sympy.lambdify(p,logL,'numpy'),x)),'k-',lw=3)
ax.plot(sol,logL.subs(p,sol),'o',
        color='gray',ms=15,label='Estimated')
ax.plot(p_true,logL.subs(p,p_true),'s',
        color='k',ms=15,label='Actual')
ax.set_xlabel('$p$',fontsize=18)
ax.set_ylabel('Likelihood',fontsize=18)
ax.set_title('Estimate not equal to true value',fontsize=18)
ax.legend(loc=0)


# **Programming Tip.**
# 
# In the prior code, we use the `lambdify` function in
# `lambdify(p,logJ,'numpy')` to
# take a Sympy expression and convert it into a
# Numpy version that is easier to
# compute.  The `lambdify` function has an extra
# argument where you can specify
# the function space that it should use to convert
# the expression. In the above
# this is set to Numpy.
# 
# 
# 
# <!-- dom:FIGURE: [fig-
# statistics/Maximum_likelihood_10_2.png, width=500 frac=0.75] Maximum likelihood
# estimate vs. true parameter. Note that the estimate is slightly off from the
# true value. This is a consequence of the fact that the estimator is a function
# of the data and lacks knowledge of the true underlying value.  <div
# id="fig:Maximum_likelihood_10_2"></div> -->
# <!-- begin figure -->
# <div
# id="fig:Maximum_likelihood_10_2"></div>
# 
# <p>Maximum likelihood estimate vs. true
# parameter. Note that the estimate is slightly off from the true value. This is a
# consequence of the fact that the estimator is a function of the data and lacks
# knowledge of the true underlying value.</p>
# <img src="fig-
# statistics/Maximum_likelihood_10_2.png" width=500>
# 
# <!-- end figure -->
# [Figure](#fig:Maximum_likelihood_10_2) shows  that our estimator $\hat{p}$
# (circle) is not equal to the true value of $p$ (square), despite being
# the
# maximum of the likelihood function. This may sound disturbing, but keep in
# mind
# this estimate is a function of the random data; and since that data can
# change,
# the ultimate estimate can likewise change. 
# Remember that the estimator is a
# *function* of the data and is thus also a
# *random variable*, just like the data
# is. This means it has its own probability
# distribution with corresponding mean
# and variance. So, what we are observing is
# a consequence of that variance.
# 
# <!--
# !bc pycod -->
# <!-- def estimator_gen(niter=10,ns=100): -->
# <!-- 'generate data
# to estimate distribution of maximum likelihood estimator' -->
# <!-- out=[] -->
# <!-- # make sympy variable real-valued -->
# <!-- x=sympy.symbols('x',real=True)
# -->
# <!-- # likelihood function -->
# <!-- L= p**x*(1-p)**(1-x) -->
# <!-- for i in
# range(niter): -->
# <!-- # generate some samples from the experiment -->
# <!-- xs =
# sample(ns) -->
# <!-- # objective function to maximize -->
# <!--
# J=np.prod([L.subs(x,i) for i in xs]) -->
# <!-- # log is easier to work with -->
# <!-- logJ=sympy.expand_log(sympy.log(J)) -->
# <!-- # use basic calculus to find
# extrema -->
# <!-- sol=sympy.solve(sympy.diff(logJ,p),p)[0] -->
# <!-- # convert
# output to numeric float from sympy -->
# <!-- out.append(float(sol.evalf())) -->
# <!-- # return scalar if list contains only 1 term -->
# <!-- return out if
# len(out)>1 else out[0] -->
# <!-- !ec -->
# 
# <!-- dom:FIGURE: [fig-
# statistics/Maximum_likelihood_30_2.png, width=500 frac=0.85] Histogram of
# maximum likelihood estimates. The title shows the estimated mean and standard
# deviation of the samples. <div id="fig:Maximum_likelihood_30_2"></div>  -->
# <!--
# begin figure -->
# <div id="fig:Maximum_likelihood_30_2"></div>
# 
# <p>Histogram of
# maximum likelihood estimates. The title shows the estimated mean and standard
# deviation of the samples.</p>
# <img src="fig-
# statistics/Maximum_likelihood_30_2.png" width=500>
# 
# <!-- end figure -->
# [Figure](#fig:Maximum_likelihood_30_2) shows what happens when you run many
# thousands of coin experiments and compute the maximum likelihood
# estimate for
# each experiment, given a particular number of samples 
# per experiment. This
# simulation gives us a histogram of the maximum likelihood
# estimates, which is an
# approximation of the probability distribution of the
# $\hat{p}$ estimator itself.
# This figure shows that the sample mean
# of the estimator ($\mu = \frac{1}{n}\sum
# \hat{p}_i$) is pretty close to the
# true value, but looks can be deceiving. The
# only way to know for sure is to
# check if the estimator is unbiased, namely, if
# $$
# \mathbb{E}(\hat{p}) = p
# $$
# 
#  Because this problem is simple, we can solve for
# this in general
# noting that
# the terms above are either $p$, if $x_i=1$ or $1-p$
# if $x_i=0$.
# This means that
# we can write
# 
# $$
# \mathcal{L}(p\vert \mathbf{x})=
# p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n
# x_i}
# $$
# 
#  with corresponding logarithm
# as
# 
# $$
# J=\log(\mathcal{L}(p\vert \mathbf{x})) =  \log(p)  \sum_{i=1}^n x_i +
# \log(1-p) \left(n-\sum_{i=1}^n x_i\right)
# $$
# 
#  Taking the derivative of this
# gives:
# 
# $$
# \frac{dJ}{dp} = \frac{1}{p}\sum_{i=1}^n x_i + \frac{(n-\sum_{i=1}^n
# x_i)}{p-1}
# $$
# 
#  and solving this for $p$ leads to
# 
# $$
# \hat{p} = \frac{1}{ n}
# \sum_{i=1}^n x_i
# $$
# 
# This is our *estimator* for $p$. Up until now, we have been
# using Sympy to
# solve
# for this based on the data $x_i$ but now that we have it
# analytically we
# don't
# have to solve for it each time. To check if this estimator
# is biased, we
# compute
# its expectation:
# 
# $$
# \mathbb{E}\left(\hat{p}\right)
# =\frac{1}{n}\sum_i^n \mathbb{E}(x_i) =
# \frac{1}{n} n \mathbb{E}(x_i)
# $$
# 
#  by
# linearity of the expectation and where
# 
# $$
# \mathbb{E}(x_i)  = p
# $$
# 
#  Therefore,
# $$
# \mathbb{E}\left(\hat{p}\right) =p
# $$
# 
#  This means that the estimator is
# *unbiased*. Similarly,
# 
# $$
# \mathbb{E}\left(\hat{p}^2\right) = \frac{1}{n^2}
# \mathbb{E}\left[\left(
# \sum_{i=1}^n x_i \right)^2 \right]
# $$
# 
#  and where
# 
# $$
# \mathbb{E}\left(x_i^2\right) =p
# $$
# 
#  and by the independence assumption,
# 
# $$
# \mathbb{E}\left(x_i x_j\right) =\mathbb{E}(x_i)\mathbb{E}(x_j) =p^2
# $$
# 
#  Thus,
# $$
# \mathbb{E}\left(\hat{p}^2\right) =\left(\frac{1}{n^2}\right) n \left[
# p+(n-1)p^2 \right]
# $$
# 
#  So, the variance of the estimator, $\hat{p}$, is the
# following:
# 
# $$
# \mathbb{V}(\hat{p}) = \mathbb{E}\left(\hat{p}^2\right)-
# \mathbb{E}\left(\hat{p}\right)^2  = \frac{p(1-p)}{n}
# $$
# 
#  Note that the $n$ in
# the denominator means that the variance
# asymptotically
# goes to zero as $n$
# increases (i.e., we consider more and
# more samples). This is
# good news because
# it means that more and
# more coin flips lead to a better
# estimate of the
# underlying $p$.
# 
# Unfortunately, this formula for the variance is
# practically
# useless because we
# need $p$ to compute it and $p$ is the parameter
# we are trying
# to estimate in
# the first place! However, this is where the *plug-
# in* principle
# [^invariance-property] 
# saves the day.  It turns out in this
# situation, you can
# simply substitute the maximum likelihood estimator,
# $\hat{p}$, for the $p$ in
# the above equation to obtain the asymptotic variance
# for $\mathbb{V}(\hat{p})$.
# The fact that this works is guaranteed by the
# asymptotic theory of maximum
# likelihood estimators.
# 
# [^invariance-property]:
# This is also known as the
# *invariance property*
# of maximum likelihood
# estimators. It basically states that
# the 
# maximum likelihood estimator of any
# function, say, $h(\theta)$, is
# the same
# $h$ with the maximum likelihood
# estimator for $\theta$ substituted
# in for
# $\theta$; namely, $h(\theta_{ML})$.
# Nevertheless, looking at
# $\mathbb{V}(\hat{p})^2$, we can immediately notice
# that
# if $p=0$, then there is
# no estimator variance because the outcomes are
# guaranteed to be tails. Also, for
# any $n$,  the maximum of this variance
# happens
# at $p=1/2$. This is our worst
# case scenario and the only way to
# compensate is
# with larger $n$.
# 
# All we have
# computed is the mean and variance of the
# estimator. In general,
# this is
# insufficient to characterize the underlying
# probability density of
# $\hat{p}$,
# except if we somehow knew that $\hat{p}$ were
# normally distributed.
# This is
# where the powerful *Central Limit Theorem* we
# discussed in the section
# [ch:stats:sec:limit](#ch:stats:sec:limit) comes in. The
# form of the estimator,
# which is just a
# sample mean, implies that we can apply
# this theorem and conclude
# that $\hat{p}$
# is asymptotically normally distributed.
# However, it doesn't
# quantify how many
# samples $n$ we need. In our simulation
# this is no problem
# because we can
# generate as much data as we like, but in the
# real world, with a
# costly
# experiment, each sample may be precious [^edgeworth].
# In the following,
# we won't apply the Central Limit Theorem and instead proceed
# analytically.
# [^edgeworth]: It turns out that the central limit theorem
# augmented with an
# Edgeworth expansion tells us that convergence is regulated by
# the skewness
# of
# the distribution
# [[feller1950introduction]](#feller1950introduction). In other
# words, the 
# more
# symmetric the distribution, the faster it converges to the
# normal
# distribution
# according to the central limit theorem.
# 
# ### Probability Density for the Estimator
# 
# To write out the full density for $\hat{p}$, we first
# have to ask
# what is
# the probability that the estimator will equal a specific
# value and the
# tally up
# all the ways that could happen with their corresponding
# probabilities.
# For
# example, what is the probability that
# 
# $$
# \hat{p} =
# \frac{1}{n}\sum_{i=1}^n x_i  = 0
# $$
# 
#   This can only happen one way: when $x_i=0
# \hspace{0.5em} \forall i$. The
# probability of this happening can be computed
# from the density
# 
# $$
# f(\mathbf{x},p)= \prod_{i=1}^n \left(p^{x_i} (1-p)^{1-x_i}
# \right)
# $$
# 
# $$
# f\left(\sum_{i=1}^n x_i  = 0,p\right)= \left(1-p\right)^n
# $$
# Likewise, if $\lbrace x_i \rbrace$ has only one nonzero element, then
# 
# $$
# f\left(\sum_{i=1}^n x_i  = 1,p\right)= n p \prod_{i=1}^{n-1} \left(1-p\right)
# $$
# where the $n$ comes from the $n$ ways to pick one element
# from the $n$ elements
# $x_i$. Continuing this way, we can construct the
# entire density as
# 
# $$
# f\left(\sum_{i=1}^n x_i  = k,p\right)= \binom{n}{k} p^k  (1-p)^{n-k}
# $$
# 
#  where
# the first term on the right is the binomial coefficient of $n$ things
# taken $k$
# at a time. This is the binomial distribution and it's not the
# density
# for
# $\hat{p}$, but rather for $n\hat{p}$. We'll leave this as-is
# because it's
# easier
# to work with below. We just have to remember to keep
# track of the $n$
# factor.
# **Confidence Intervals**
# 
# Now that we have the full density for
# $\hat{p}$, we
# are ready to ask some
# meaningful questions. For example, what is
# the probability
# the estimator is within
# $\epsilon$ fraction of the true value
# of $p$?
# 
# $$
# \mathbb{P}\left( \vert  \hat{p}-p \vert  \le \epsilon p \right)
# $$
# 
#  More
# concretely, we want to know how often the
# estimated $\hat{p}$ is trapped
# within
# $\epsilon$ of the actual value.  That is,
# suppose we ran the experiment
# 1000
# times to generate 1000 different estimates
# of $\hat{p}$. What percentage of
# the
# 1000 so-computed values are trapped within
# $\epsilon$ of the underlying
# value.
# Rewriting the above equation as the
# following,
# 
# $$
# \mathbb{P}\left(p-\epsilon p
# < \hat{p} < p + \epsilon p \right) =
# \mathbb{P}\left(  n p - n \epsilon p <
# \sum_{i=1}^n x_i < n p + n \epsilon p
# \right)
# $$
# 
#  Let's plug in some live
# numbers here for our worst case
# scenario (i.e., highest
# variance scenario) where
# $p=1/2$. Then, if
# $\epsilon = 1/100$, we have
# 
# $$
# \mathbb{P}\left( \frac{99
# n}{100} < \sum_{i=1}^n x_i   < \frac{101 n}{100}
# \right)
# $$
# 
#  Since the sum in
# integer-valued, we need $n> 100$ to even compute this.
# Thus,
# if $n=101$ we have,
# $$
# \begin{eqnarray*}
# \mathbb{P}\left(\frac{9999}{200} < \sum_{i=1}^{101} x_i <
# \frac{10201}{200} \right) = f\left(\sum_{i=1}^{101} x_i = 50,p\right) &  \ldots
# \\\
# = \binom{101}{50} (1/2)^{50}  (1-1/2)^{101-50} & = & 0.079
# \end{eqnarray*}
# $$
# 
#  This means that in the worst-case scenario for $p=1/2$, given $n=101$
# trials,
# we will only get within 1\% of the actual $p=1/2$ about 8\% of the
# time.
# If you
# feel disappointed, it is because you've been paying attention.
# What if
# the coin
# was really heavy and it was hard work to repeat this 101 times?
# 
# Let's
# come at
# this another way: given I could only flip the coin 100
# times, how close
# could I
# come to the true underlying value with high
# probability (say, 95\%)? In
# this
# case, instead of picking a value for
# $\epsilon$, we are solving for
# $\epsilon$.
# Plugging in gives,
# 
# $$
# \mathbb{P}\left(50 - 50\epsilon <
# \sum_{i=1}^{100} x_i < 50 + 50 \epsilon
# \right) = 0.95
# $$
# 
#  which we have to
# solve for $\epsilon$. Fortunately, all the tools we
# need to
# solve for this are
# already in Scipy.

# In[7]:


from scipy.stats import binom
# n=100, p = 0.5, distribution of the estimator phat
b=binom(100,.5) 
# symmetric sum the probability around the mean
g = lambda i:b.pmf(np.arange(-i,i)+50).sum() 
print (g(10)) # approx 0.95


# In[8]:


get_ipython().run_line_magic('matplotlib', 'inline')

from matplotlib.pylab import subplots, arange
fig,ax= subplots()
fig.set_size_inches((10,5))
# here is the density of the sum of x_i
_=ax.stem(arange(0,101),b.pmf(arange(0,101)),
        linefmt='k-', markerfmt='ko',use_line_collection=True) 
_=ax.vlines( [50+10,50-10],0 ,ax.get_ylim()[1] ,color='k',lw=3.)
_=ax.axis(xmin=30,xmax=70)
_=ax.tick_params(labelsize=18)
fig.savefig('fig-statistics/Maximum_likelihood_20_2.png')
fig.tight_layout()


# <!-- dom:FIGURE: [fig-statistics/Maximum_likelihood_20_2.png, width=500
# frac=0.85] Probability mass function for $\hat{p}$. The two vertical lines form
# the confidence interval. <div id="fig:Maximum_likelihood_20_2"></div> -->
# <!--
# begin figure -->
# <div id="fig:Maximum_likelihood_20_2"></div>
# 
# <p>Probability
# mass function for $\hat{p}$. The two vertical lines form the confidence
# interval.</p>
# <img src="fig-statistics/Maximum_likelihood_20_2.png" width=500>
# <!-- end figure -->
# 
# 
#  The two vertical lines in
# [Figure](#fig:Maximum_likelihood_20_2)
# show how far out from the mean we have to
# go to accumulate 95\% of the
# probability. Now, we can solve this as
# 
# $$
# 50+50\epsilon=60
# $$
# 
#  which makes $\epsilon=1/5$ or 20\%. So, flipping 100 times
# means I can
# only get
# within 20\% of the real $p$ 95\% of the time in the worst
# case
# scenario (i.e.,
# $p=1/2$). The following code verifies the situation.

# In[9]:


from scipy.stats import bernoulli 
b=bernoulli(0.5) # coin distribution
xs = b.rvs(100) # flip it 100 times
phat = np.mean(xs) # estimated p
print (abs(phat-0.5) < 0.5*0.20) # make it w/in interval?


# Let's keep doing this and see if we can get within this interval 95\% of
# the
# time.

# In[10]:


out=[]
b=bernoulli(0.5) # coin distribution
for i in range(500):   # number of tries
    xs = b.rvs(100)    # flip it 100 times
    phat = np.mean(xs) # estimated p
    out.append(abs(phat-0.5) < 0.5*0.20 ) # within 20% ?

# percentage of tries w/in 20% interval
print (100*np.mean(out))


# Well, that seems to work! Now we have a way to get at the quality of
# the
# estimator, $\hat{p}$.
# 
# **Maximum Likelihood Estimator Without Calculus**
# 
# The
# prior example showed how we can use calculus to compute the maximum
# likelihood
# estimator. It's important to emphasize that the maximum likelihood
# principle
# does *not* depend on calculus and extends to more general situations
# where
# calculus is impossible. For example, let $X$ be uniformly distributed in
# the
# interval $[0,\theta]$.  Given $n$ measurements of $X$, the likelihood
# function
# is the following:
# 
# $$
# L(\theta) = \prod_{i=1}^n \frac{1}{\theta} =
# \frac{1}{\theta^n}
# $$
# 
#  where each $x_i \in [0,\theta]$. Note that the slope of
# this function
# is not
# zero anywhere so the usual calculus approach is not going
# to work here.
# Because
# the likelihood is the product of the individual uniform
# densities, if
# any of the
# $x_i$ values were outside of the proposed $[0,\theta]$
# interval,
# then the
# likelihood would go to zero, because the uniform density is
# zero
# outside of the
# $[0,\theta]$. This is no good for maximization. Thus,
# observing
# that the
# likelihood function is strictly decreasing with increasing
# $\theta$,
# we conclude
# that the value for $\theta$ that maximizes the likelihood
# is the
# maximum of the
# $x_i$ values. To summarize, the maximum likelihood
# estimator is
# the following:
# 
# $$
# \theta_{ML} = \max_i x_i
# $$
# 
#  As always, we want
# the distribution of this estimator to judge its
# performance.
# In this case, this
# is pretty straightforward. The cumulative
# density function
# for the $\max$
# function is the following:
# 
# $$
# \mathbb{P} \left( \hat{\theta}_{ML} < v \right) =
# \mathbb{P}( x_0 \leq v
# \wedge x_1 \leq v \ldots \wedge x_n \leq v)
# $$
# 
#  and
# since all the $x_i$ are uniformly distributed in $[0,\theta]$, we have
# 
# $$
# \mathbb{P} \left( \hat{\theta}_{ML} < v  \right) =
# \left(\frac{v}{\theta}\right)^n
# $$
# 
#  So, the probability density function is
# then,
# 
# $$
# f_{\hat{\theta}_{ML}}(\theta_{ML}) =  n \theta_{ML}^{ n-1 } \theta^{
# -n }
# $$
# 
#  Then, we can compute the $\mathbb{E}(\theta_{ML}) = (\theta n)/(n+1)$
# with
# corresponding variance as $\mathbb{V}(\theta_{ML}) = (\theta^2
# n)/(n+1)^2/(n+2)$.
# 
# For a quick sanity check, we can write the following
# simulation for $\theta =1$
# as in the following:

# In[11]:


from scipy import stats
rv = stats.uniform(0,1)  # define uniform random variable
mle=rv.rvs((100,500)).max(0) # max along row-dimension
print (np.mean(mle)) # approx n/(n+1) = 100/101 ~= 0.99
print (np.var(mle)) #approx n/(n+1)**2/(n+2) ~= 9.61E-5
9.95762009884e-05


# **Programming Tip.**
# 
# The `max(0)` suffix on for the `mle` computation takes
# the
# maximum of the so-computed array along the row (`axis=0`)
# dimension.
# 
# 
# 
#  You can
# also plot `hist(mle)` to see the histogram of the simulated
# maximum likelihood
# estimates and match it up against the probability density
# function we derived
# above.  
# 
# 
# In this section, we explored the concept of maximum
# likelihood
# estimation using a coin flipping experiment both analytically and
# numerically
# with the scientific Python stack. We also explored the case when
# calculus is not
# workable for maximum likelihood estimation.  There are two key
# points to
# remember. First, maximum likelihood estimation produces a function of
# the data
# that is itself a random variable, with its own probability
# distribution. We can
# get at the quality of the so-derived estimators by
# examining the confidence
# intervals around the estimated values using the
# probability distributions
# associated with the estimators themselves.  
# Second, maximum likelihood
# estimation applies even in situations 
# where using basic calculus is not
# applicable [[wasserman2004all]](#wasserman2004all).
# 
# 
# ## Delta Method
# <div
# id="sec:delta_method"></div>
# 
# Sometimes we want to characterize the distribution
# of a *function* of a random
# variable. In order to extend and generalize the
# Central Limit Theorem in this
# way, we need the Taylor series expansion. Recall
# that the Taylor series
# expansion is an approximation of a function of the
# following form,
# 
# $$
# T_r(x) =\sum_{i=0}^r \frac{g^{(i)}(a)}{i!}(x-a)^i
# $$
# 
#  this
# basically says that a function $g$ can be adequately
# approximated about a
# point
# $a$ using a polynomial based on its derivatives
# evaluated at $a$. Before
# we
# state the general theorem, let's examine
# an example to understand how the
# mechanics work.
# 
# **Example.**  Suppose that $X$ is a random variable with
# $\mathbb{E}(X)=\mu\neq 0$.  Furthermore, supposedly have a suitable
# function $g$
# and we want the distribution of $g(X)$. Applying the
# Taylor series expansion, we
# obtain the following,
# 
# $$
# g(X) \approx g(\mu)+ g^{\prime}(\mu)(X-\mu)
# $$
# 
#  If we
# use $g(X)$  as an estimator for $g(\mu)$, then we can say that
# we
# approximately
# have the following
# 
# $$
# \begin{align*}
# \mathbb{E}(g(X)) &=g(\mu) \\\
# \mathbb{V}(g(X))
# &=(g^{\prime}(\mu))^2 \mathbb{V}(X) \\\
# \end{align*}
# $$
# Concretely, suppose we want to estimate the odds, $\frac{p}{1-p}$.
# For example,
# if $p=2/3$, then we say that the odds is `2:1` meaning that the
# odds of the one
# outcome are twice as likely as the odds of the other outcome.
# Thus, we have
# $g(p)=\frac{p}{1-p}$ and we want to find
# $\mathbb{V}(g(\hat{p}))$.  In our coin-
# flipping problem, we have the
# estimator $\hat{p}=\frac{1}{n}\sum X_k$ from the
# Bernoulli-distributed data
# $X_k$ individual coin-flips. Thus,
# 
# $$
# \begin{align*}
# \mathbb{E}(\hat{p}) &= p  \\\
# \mathbb{V}(\hat{p}) &=
# \frac{p(1-p)}{n}  \\\
# \end{align*}
# $$
# 
#   Now, $g^\prime(p)=1/(1-p)^2$, so we have,
# 
# $$
# \begin{align*}
# \mathbb{V}(g(\hat{p}))&=(g^\prime(p))^2 \mathbb{V}(\hat{p})
# \\\
# &=\left(\frac{1}{(1-p)^2}\right)^2 \frac{p(1-p)}{n}
# \\\
#                       &=
# \frac{p}{n(1-p)^3}  \\\
# \end{align*}
# $$
# 
#  which is an approximation of the
# variance of the estimator
# $g(\hat{p})$. Let's
# simulate this and see how it
# agrees.

# In[12]:


from scipy import stats
# compute MLE estimates 
d=stats.bernoulli(0.1).rvs((10,5000)).mean(0)
# avoid divide-by-zero
d=d[np.logical_not(np.isclose(d,1))]
# compute odds ratio
odds = d/(1-d)
print ('odds ratio=',np.mean(odds),'var=',np.var(odds))


# The first number above is the mean of the simulated odds
# ratio and the second
# is
# the variance of the estimate.  According to
# the variance estimate above, we
# have
# $\mathbb{V}(g(1/10))\approx
# 0.0137$, which is not too bad for this
# approximation. Recall we want
# to estimate the odds from $\hat{p}$. The code
# above takes $5000$
# estimates of the $\hat{p}$ to estimate $\mathbb{V}(g)$. The
# odds ratio
# for $p=1/10$ is $1/9\approx 0.111$.
# 
# **Programming Tip.**
# 
# The code
# above uses the `np.isclose` function to identify the ones from
# the simulation
# and the `np.logical_not` removes these elements from the
# data because the odds
# ratio has a zero in the denominator
# for these values.
# 
# 
# 
# Let's try this again
# with a probability of heads of `0.5` instead of
# `0.3`.

# In[13]:


from scipy import stats
d=stats.bernoulli(.5).rvs((10,5000)).mean(0)
d=d[np.logical_not(np.isclose(d,1))]
print( 'odds ratio=',np.mean(d),'var=',np.var(d))


# The odds ratio is this case is equal to one, which
# is not close to what was
# reported. According to our
# approximation, we should have $\mathbb{V}(g)=0.4$,
# which does not
# look like what our simulation just reported.  This is
# because the
# approximation is best when the odds ratio is
# nearly linear and worse otherwise
# (see [Figure](#fig:Maximum_likelihood_0001)).
# 
# <!-- dom:FIGURE: [fig-
# statistics/Maximum_likelihood_0001.png, width=500 frac=0.85] The odds ratio is
# close to linear for small values but becomes unbounded as $p$ approaches one.
# The delta method is more effective for small underlying values of $p$, where the
# linear approximation is better. <div id="fig:Maximum_likelihood_0001"></div> -->
# <!-- begin figure -->
# <div id="fig:Maximum_likelihood_0001"></div>
# 
# <p>The odds
# ratio is close to linear for small values but becomes unbounded as $p$
# approaches one. The delta method is more effective for small underlying values
# of $p$, where the linear approximation is better.</p>
# <img src="fig-
# statistics/Maximum_likelihood_0001.png" width=300>
# 
# <!-- end figure -->