Some or most of you have probably taken some undergraduate- or graduate-level statistics courses. Unfortunately, the curricula for most introductory statisics courses are mostly focused on conducting statistical hypothesis tests as the primary means for interest: t-tests, chi-squared tests, analysis of variance, etc. Such tests seek to esimate whether groups are "significantly different "or effects are "statistically significant", a concept that is poorly understood, and hence, often misused by practioners. Even when interpreted correctly, statistical significance (as characterized by a small p-value) is a questionable goal for statistical inference, as it is not a measure of evidence in any statistical sense.
A far more powerful approach to statistical analysis involves building flexible models with the overarching aim of estimating quantities of interest. This section of the tutorial illustrates how to use Python to build statistical models of low to moderate difficulty from scratch, and use them to extract estimates and associated measures of uncertainty. These estimates can then be passed on to individuals with domain expertise who can then appraise them for "real-world" significance.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
An recurring statistical problem is finding estimates of the relevant parameters that correspond to the distribution that best represents our data.
In parametric inference, we specify a priori a suitable distribution, then choose the parameters that best fit the data.
x = np.array([ 1.00201077, 1.58251956, 0.94515919, 6.48778002, 1.47764604,
5.18847071, 4.21988095, 2.85971522, 3.40044437, 3.74907745,
1.18065796, 3.74748775, 3.27328568, 3.19374927, 8.0726155 ,
0.90326139, 2.34460034, 2.14199217, 3.27446744, 3.58872357,
1.20611533, 2.16594393, 5.56610242, 4.66479977, 2.3573932 ])
_ = plt.hist(x, bins=10)
We start with the problem of finding values for the parameters that provide the best fit between the model and the data, called point estimates. First, we need to define what we mean by ‘best fit’. There are two commonly used criteria:
Probability Mass Function:
For discrete $X$,
$$Pr(X=x) = f(x|\theta)$$*e.g. Poisson distribution*
The Poisson distribution models unbounded counts:
Probability Density Function:
For continuous $X$,
$$Pr(x \le X \le x + dx) = f(x|\theta)dx \, \text{ as } \, dx \rightarrow 0$$*e.g. normal distribution*
The dataset nashville_precip.txt
contains NOAA precipitation data for Nashville measured since 1871.
precip = pd.read_table("data/nashville_precip.txt", index_col=0,
na_values='NA', delim_whitespace=True)
precip.head()
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year 1871 2.76 4.58 5.01 4.13 3.30 2.98 1.58 2.36 0.95 1.31 2.13 1.65 1872 2.32 2.11 3.14 5.91 3.09 5.17 6.10 1.65 4.50 1.58 2.25 2.38 1873 2.96 7.14 4.11 3.59 6.31 4.20 4.63 2.36 1.81 4.28 4.36 5.94 1874 5.22 9.23 5.36 11.84 1.49 2.87 2.65 3.52 3.12 2.63 6.12 4.19 1875 6.15 3.06 8.14 4.22 1.73 5.63 8.12 1.60 3.79 1.25 5.46 4.30 [5 rows x 12 columns]
_ = precip.hist(sharex=True, sharey=True, grid=False, figsize=(10,6))
plt.tight_layout()
The first step is recognixing what sort of distribution to fit our data to. A couple of observations:
The gamma distribution is often a good fit to aggregated rainfall data, and will be our candidate distribution in this case.
The *method of moments* simply assigns the empirical mean and variance to their theoretical counterparts, so that we can solve for the parameters.
So, for the gamma distribution, the mean and variance are:
from IPython.display import IFrame
IFrame('http://en.wikipedia.org/wiki/Gamma_distribution?useformat=mobile', width=600, height=350)
So, if we solve for these parameters, we can use a gamma distribution to describe our data:
Let's deal with the missing value in the October data.
precip.Oct.ix[1960:1970]
Year 1960 1.38 1961 1.12 1962 2.29 1963 NaN 1964 1.83 1965 0.57 1966 2.50 1967 1.57 1968 3.92 1969 2.01 1970 2.94 Name: Oct, dtype: float64
Given what we are trying to do, it is most sensible to fill in the missing value with the average of the available values.
precip.fillna(value={'Oct': precip.Oct.mean()}, inplace=True)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov \ Year 1871 2.76 4.58 5.01 4.13 3.30 2.98 1.58 2.36 0.95 1.31 2.13 1872 2.32 2.11 3.14 5.91 3.09 5.17 6.10 1.65 4.50 1.58 2.25 1873 2.96 7.14 4.11 3.59 6.31 4.20 4.63 2.36 1.81 4.28 4.36 1874 5.22 9.23 5.36 11.84 1.49 2.87 2.65 3.52 3.12 2.63 6.12 1875 6.15 3.06 8.14 4.22 1.73 5.63 8.12 1.60 3.79 1.25 5.46 1876 6.41 2.22 5.28 3.62 3.40 5.65 7.15 5.77 2.52 2.68 1.26 1877 4.05 1.06 4.98 9.47 1.25 6.02 3.25 4.16 5.40 2.61 4.93 1878 3.34 2.10 3.48 6.88 2.33 3.28 9.43 5.02 1.28 2.17 3.20 1879 6.32 3.13 3.81 2.88 2.88 2.50 8.47 4.62 5.18 2.90 5.85 1880 3.74 12.37 8.16 5.26 4.13 3.97 5.69 2.22 5.39 7.24 5.77 1881 3.54 5.48 2.79 5.12 3.67 3.70 0.86 1.81 6.57 4.80 4.89 1882 14.51 8.61 9.38 3.59 7.38 2.54 4.06 5.54 1.61 1.11 3.60 1883 3.76 7.90 3.98 9.12 4.82 3.82 4.94 4.47 2.23 5.27 3.11 1884 7.20 8.18 8.89 3.51 3.58 6.53 3.18 2.81 2.36 2.43 1.57 1885 6.29 2.00 2.33 3.75 4.36 3.72 5.26 1.02 5.60 2.99 2.73 1886 5.18 3.82 4.76 2.36 2.10 7.69 1.90 5.50 3.68 0.51 5.76 1887 5.13 8.47 3.36 2.67 3.43 2.31 3.77 2.89 6.85 1.92 2.29 1888 6.29 3.78 6.46 4.18 2.97 4.68 2.36 7.03 3.82 2.82 4.33 1889 3.83 1.84 2.47 2.83 5.30 5.33 2.74 1.57 6.81 1.54 6.88 1890 8.10 10.95 8.64 3.84 4.16 2.23 0.46 6.59 5.86 3.01 2.01 1891 6.15 6.96 10.31 2.24 2.39 6.50 1.49 3.72 1.25 0.84 6.71 1892 2.81 2.73 4.10 7.45 4.03 5.01 5.13 3.39 4.78 0.25 3.91 1893 1.27 4.88 3.37 4.11 7.31 4.74 2.12 1.92 6.43 3.68 2.97 1894 4.28 8.65 2.69 4.05 2.53 3.55 5.45 2.43 3.07 0.53 1.92 1895 5.71 0.98 5.09 3.07 2.05 2.90 7.14 1.40 6.69 1.57 2.14 ... ... ... ... ... ... ... ... ... ... ... Dec Year 1871 1.65 1872 2.38 1873 5.94 1874 4.19 1875 4.30 1876 0.95 1877 2.49 1878 6.04 1879 9.15 1880 3.32 1881 4.85 1882 1.52 1883 4.97 1884 3.78 1885 2.90 1886 1.48 1887 5.31 1888 1.77 1889 1.17 1890 4.12 1891 4.26 1892 6.43 1893 3.50 1894 2.81 1895 4.09 ... [141 rows x 12 columns]
Now, let's calculate the sample moments of interest, the means and variances by month:
precip_mean = precip.mean()
precip_mean
Jan 4.523688 Feb 4.097801 Mar 4.977589 Apr 4.204468 May 4.325674 Jun 3.873475 Jul 3.895461 Aug 3.367305 Sep 3.377660 Oct 2.610500 Nov 3.685887 Dec 4.176241 dtype: float64
precip_var = precip.var()
precip_var
The history saving thread hit an unexpected error (OperationalError('disk I/O error',)).History will not be written to the database.
Jan 6.928862 Feb 5.516660 Mar 5.365444 Apr 4.117096 May 5.306409 Jun 5.033206 Jul 3.777012 Aug 3.779876 Sep 4.940099 Oct 2.741659 Nov 3.679274 Dec 5.418022 dtype: float64
We then use these moments to estimate $\alpha$ and $\beta$ for each month:
alpha_mom = precip_mean ** 2 / precip_var
beta_mom = precip_var / precip_mean
estimates = pd.DataFrame({'alpha_mom': alpha_mom, 'beta_mom': beta_mom})
estimates
alpha_mom beta_mom Jan 2.953407 1.531684 Feb 3.043866 1.346249 Mar 4.617770 1.077920 Apr 4.293694 0.979219 May 3.526199 1.226724 Jun 2.980965 1.299403 Jul 4.017624 0.969593 Aug 2.999766 1.122522 Sep 2.309383 1.462581 Oct 2.485616 1.050243 Nov 3.692511 0.998206 Dec 3.219070 1.297344 [12 rows x 2 columns]
We can use the gamma.pdf
function in scipy.stats.distributions
to plot the ditribtuions implied by the calculated alphas and betas. For example, here is January:
from scipy.stats import gamma
xvals = np.linspace(0, 10)
yvals = gamma.pdf(xvals, alpha_mom[0], beta_mom[0])
precip.Jan.hist(normed=True, bins=20)
plt.plot(xvals, yvals)
[<matplotlib.lines.Line2D at 0x105d98fd0>]
Looping over all months, we can create a grid of plots for the distribution of rainfall, using the gamma distribution:
axs = precip.hist(normed=True, figsize=(12, 8), sharex=True, sharey=True, bins=15, grid=False)
for ax in axs.ravel():
# Get month
m = ax.get_title()
# Plot fitted distribution
x = np.linspace(*ax.get_xlim())
ax.plot(x, gamma.pdf(x, alpha_mom[m], beta_mom[m]))
# Annotate with parameter estimates
label = 'alpha = {0:.2f}\nbeta = {1:.2f}'.format(alpha_mom[m], beta_mom[m])
ax.annotate(label, xy=(10, 0.2))
plt.tight_layout()
Maximum likelihood (ML) fitting is usually more work than the method of moments, but it is preferred as the resulting estimator is known to have good theoretical properties.
There is a ton of theory regarding ML. We will restrict ourselves to the mechanics here.
Say we have some data $y = y_1,y_2,\ldots,y_n$ that is distributed according to some distribution:
Here, for example, is a Poisson distribution that describes the distribution of some discrete variables, typically counts:
y = np.random.poisson(5, size=100)
plt.hist(y, bins=np.sqrt(len(y)), normed=True)
plt.xlabel('y'); plt.ylabel('Pr(y)')
<matplotlib.text.Text at 0x1080809d0>
The product $\prod_{i=1}^n Pr(y_i | \theta)$ gives us a measure of how likely it is to observe the set of values $y_1,\ldots,y_n$ given the parameters $\theta$. Maximum likelihood fitting consists of choosing the appropriate function $l= Pr(Y|\theta)$ to maximize for a given set of observations. We call this function the likelihood function, because it is a measure of how likely the observations are if the model is true.
Given these data, how likely is this model?
In the above model, the data were drawn from a Poisson distribution with parameter $\lambda =5$.
$$L(y|\lambda=5) = \frac{e^{-5} 5^y}{y!}$$So, for any given value of $y$, we can calculate its likelihood:
poisson_like = lambda x, lam: np.exp(-lam) * (lam**x) / (np.arange(x)+1).prod()
lam = 6
value = 10
poisson_like(value, lam)
0.041303093412337726
np.sum(poisson_like(yi, lam) for yi in y)
12.681664407577044
lam = 8
np.sum(poisson_like(yi, lam) for yi in y)
8.058131431782682
We can plot the likelihood function for any value of the parameter(s):
lambdas = np.linspace(0,15)
x = 5
plt.plot(lambdas, [poisson_like(x, l) for l in lambdas])
plt.xlabel('$\lambda$')
plt.ylabel('L($\lambda$|x={0})'.format(x));
How is the likelihood function different than the probability distribution (or mass) function? The likelihood is a function of the parameter(s) given the data, whereas the PDF returns the probability of data given a particular parameter value. Here is the PMF of the Poisson for $\lambda=5$.
lam = 5
xvals = np.arange(15)
yvals = [poisson_like(x, lam) for x in xvals]
plt.bar(xvals, yvals)
plt.xlabel('x')
plt.ylabel('Pr(X|$\lambda$=5)');
Why are we interested in the likelihood function?
A reasonable estimate of the true, unknown value for the parameter is one which maximizes the likelihood function. So, inference is reduced to an optimization problem.
Going back to the rainfall data, if we are using a gamma distribution we need to maximize:
$$\begin{align}l(\alpha,\beta) &= \sum_{i=1}^n \log[\beta^{\alpha} x^{\alpha-1} e^{-x/\beta}\Gamma(\alpha)^{-1}] \cr &= n[(\alpha-1)\overline{\log(x)} - \bar{x}\beta + \alpha\log(\beta) - \log\Gamma(\alpha)]\end{align}$$(Its usually easier to work in the log scale)
where $n = 2012 − 1871 = 141$ and the bar indicates an average over all i. We choose $\alpha$ and $\beta$ to maximize $l(\alpha,\beta)$.
Notice $l$ is infinite if any $x$ is zero. We do not have any zeros, but we do have an NA value for one of the October data, which we dealt with above.
To find the maximum of any function, we typically take the derivative with respect to the variable to be maximized, set it to zero and solve for that variable.
$$\frac{\partial l(\alpha,\beta)}{\partial \beta} = n\left(\frac{\alpha}{\beta} - \bar{x}\right) = 0$$Which can be solved as $\beta = \alpha/\bar{x}$. However, plugging this into the derivative with respect to $\alpha$ yields:
$$\frac{\partial l(\alpha,\beta)}{\partial \alpha} = \log(\alpha) + \overline{\log(x)} - \log(\bar{x}) - \frac{\Gamma(\alpha)'}{\Gamma(\alpha)} = 0$$This has no closed form solution. We must use *numerical optimization*!
Numerical optimization alogarithms take an initial "guess" at the solution, and iteratively improve the guess until it gets "close enough" to the answer.
Here, we will use Newton-Raphson algorithm:
Which is available to us via SciPy:
from scipy.optimize import newton
Here is a graphical example of how Newton-Raphson converges on a solution, using an arbitrary function:
# some function
func = lambda x: 3./(1 + 400*np.exp(-2*x)) - 1
xvals = np.linspace(0, 6)
plt.plot(xvals, func(xvals))
plt.text(5.3, 2.1, '$f(x)$', fontsize=16)
# zero line
plt.plot([0,6], [0,0], 'k-')
# value at step n
plt.plot([4,4], [0,func(4)], 'k:')
plt.text(4, -.2, '$x_n$', fontsize=16)
# tangent line
tanline = lambda x: -0.858 + 0.626*x
plt.plot(xvals, tanline(xvals), 'r--')
# point at step n+1
xprime = 0.858/0.626
plt.plot([xprime, xprime], [tanline(xprime), func(xprime)], 'k:')
plt.text(xprime+.1, -.2, '$x_{n+1}$', fontsize=16)
<matplotlib.text.Text at 0x103db3a10>
To apply the Newton-Raphson algorithm, we need a function that returns a vector containing the first and second derivatives of the function with respect to the variable of interest. In our case, this is:
from scipy.special import psi, polygamma
dlgamma = lambda m, log_mean, mean_log: np.log(m) - psi(m) - log_mean + mean_log
dl2gamma = lambda m, *args: 1./m - polygamma(1, m)
where log_mean
and mean_log
are $\log{\bar{x}}$ and $\overline{\log(x)}$, respectively. psi
and polygamma
are complex functions of the Gamma function that result when you take first and second derivatives of that function.
# Calculate statistics
log_mean = precip.mean().apply(np.log)
mean_log = precip.apply(np.log).mean()
Time to optimize!
# Alpha MLE for December
alpha_mle = newton(dlgamma, 2, dl2gamma, args=(log_mean[-1], mean_log[-1]))
alpha_mle
3.5189679152399647
And now plug this back into the solution for beta:
beta_mle = alpha_mle/precip.mean()[-1]
beta_mle
0.84261607548413797
We can compare the fit of the estimates derived from MLE to those from the method of moments:
dec = precip.Dec
dec.hist(normed=True, bins=10, grid=False)
x = np.linspace(0, dec.max())
plt.plot(x, gamma.pdf(x, alpha_mom[-1], beta_mom[-1]), 'm-')
plt.plot(x, gamma.pdf(x, alpha_mle, beta_mle), 'r--')
[<matplotlib.lines.Line2D at 0x1082c0e90>]
For some common distributions, SciPy includes methods for fitting via MLE:
from scipy.stats import gamma
ahat, _, bhat = gamma.fit(precip.Dec, floc=0)
ahat, 1./bhat
(3.5189679152399753, 0.84261607548414119)
Note that SciPy's gamma.fit
method fits a slightly different version of the gamma distribution, which has to be transformed to match ours.
In some instances, we may not be interested in the parameters of a particular distribution of data, but just a smoothed representation of the data at hand. In this case, we can estimate the disribution non-parametrically (i.e. making no assumptions about the form of the underlying distribution) using kernel density estimation.
# Some random data
y = np.random.random(15) * 10
y
array([ 3.72458625, 8.02504424, 2.5152199 , 1.15249687, 3.69592197, 4.81974995, 7.21932052, 4.78104845, 2.67680421, 9.90455954, 8.162251 , 7.12880674, 6.79736262, 5.98943549, 7.99250998])
from scipy.stats import norm
# Create an array of x-valuese``````````````````````````````
x = np.linspace(y.min()-1, y.max()+1, 100)
# Smoothing parameter
s = 0.4
# Calculate the kernels
kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1))
plt.plot(y, np.zeros(len(y)), 'ro', ms=10)
[<matplotlib.lines.Line2D at 0x1088d5150>]
SciPy implements a Gaussian KDE that automatically chooses an appropriate bandwidth. Let's create a bi-modal distribution of data that is not easily summarized by a parametric distribution:
# Create a bi-modal distribution with a mixture of Normals.
x1 = np.random.normal(0, 2, 50)
x2 = np.random.normal(4, 1, 50)
# Append by row
x = np.r_[x1, x2]
plt.hist(x, bins=10, normed=True)
(array([ 0.00940678, 0.02822033, 0.05644066, 0.13169488, 0.09406777, 0.1222881 , 0.1222881 , 0.23516943, 0.08466099, 0.05644066]), array([-4.36396175, -3.30089839, -2.23783504, -1.17477168, -0.11170832, 0.95135503, 2.01441839, 3.07748174, 4.1405451 , 5.20360846, 6.26667181]), <a list of 10 Patch objects>)
from scipy.stats import kde
density = kde.gaussian_kde(x)
xgrid = np.linspace(x.min(), x.max(), 100)
plt.hist(x, bins=10, normed=True)
plt.plot(xgrid, density(xgrid), 'r-')
[<matplotlib.lines.Line2D at 0x1088eb9d0>]
A general, primary goal of many statistical data analysis tasks is to relate the influence of one variable on another. For example, we may wish to know how different medical interventions influence the incidence or duration of disease, or perhaps a how baseball player's performance varies as a function of age.
x = np.array([2.2, 4.3, 5.1, 5.8, 6.4, 8.0])
y = np.array([0.4, 10.1, 14.0, 10.9, 15.4, 18.5])
plt.plot(x,y,'ro')
[<matplotlib.lines.Line2D at 0x108a6b1d0>]
We can build a model to characterize the relationship between $X$ and $Y$, recognizing that additional factors other than $X$ (the ones we have measured or are interested in) may influence the response variable $Y$.
where $f$ is some function, for example a linear function:
and $\epsilon_i$ accounts for the difference between the observed response $y_i$ and its prediction from the model $\hat{y_i} = \beta_0 + \beta_1 x_i$. This is sometimes referred to as process uncertainty.
We would like to select $\beta_0, \beta_1$ so that the difference between the predictions and the observations is zero, but this is not usually possible. Instead, we choose a reasonable criterion: *the smallest sum of the squared differences between $\hat{y}$ and $y$*.
Squaring serves two purposes: (1) to prevent positive and negative values from cancelling each other out and (2) to strongly penalize large deviations. Whether the latter is a good thing or not depends on the goals of the analysis.
In other words, we will select the parameters that minimize the squared error of the model.
ss = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x) ** 2)
ss([0,1],x,y)
333.35000000000002
from scipy.optimize import fmin
b0,b1 = fmin(ss, [0,1], args=(x,y))
b0,b1
Optimization terminated successfully. Current function value: 21.375000 Iterations: 79 Function evaluations: 153
(-4.3500136038870876, 3.0000002915386412)
plt.plot(x, y, 'ro')
plt.plot([0,10], [b0, b0+b1*10])
[<matplotlib.lines.Line2D at 0x108a6ff50>]
plt.plot(x, y, 'ro')
plt.plot([0,10], [b0, b0+b1*10])
for xi, yi in zip(x,y):
plt.plot([xi]*2, [yi, b0+b1*xi], 'k:')
plt.xlim(2, 9); plt.ylim(0, 20)
(0, 20)
Minimizing the sum of squares is not the only criterion we can use; it is just a very popular (and successful) one. For example, we can try to minimize the sum of absolute differences:
sabs = lambda theta, x, y: np.sum(np.abs(y - theta[0] - theta[1]*x))
b0,b1 = fmin(sabs, [0,1], args=(x,y))
print b0,b1
plt.plot(x, y, 'ro')
plt.plot([0,10], [b0, b0+b1*10])
Optimization terminated successfully. Current function value: 10.162463 Iterations: 39 Function evaluations: 77 0.00157170444494 2.31231743181
[<matplotlib.lines.Line2D at 0x108a7a990>]
We are not restricted to a straight-line regression model; we can represent a curved relationship between our variables by introducing polynomial terms. For example, a cubic model:
ss2 = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x - theta[2]*(x**2)) ** 2)
b0,b1,b2 = fmin(ss2, [1,1,-1], args=(x,y))
print b0,b1,b2
plt.plot(x, y, 'ro')
xvals = np.linspace(0, 10, 100)
plt.plot(xvals, b0 + b1*xvals + b2*(xvals**2))
Optimization terminated successfully. Current function value: 14.001110 Iterations: 198 Function evaluations: 372 -11.0748186039 6.0576975948 -0.302681057088
[<matplotlib.lines.Line2D at 0x108a3c290>]
Although polynomial model characterizes a nonlinear relationship, it is a linear problem in terms of estimation. That is, the regression model $f(y | x)$ is linear in the parameters.
For some data, it may be reasonable to consider polynomials of order>2. For example, consider the relationship between the number of home runs a baseball player hits and the number of runs batted in (RBI) they accumulate; clearly, the relationship is positive, but we may not expect a linear relationship.
ss3 = lambda theta, x, y: np.sum((y - theta[0] - theta[1]*x - theta[2]*(x**2) - theta[3]*(x**3)) ** 2)
bb = pd.read_csv("data/baseball.csv", index_col=0)
plt.plot(bb.hr, bb.rbi, 'r.')
b0,b1,b2,b3 = fmin(ss3, [0,1,-1,0], args=(bb.hr, bb.rbi))
xvals = np.arange(40)
plt.plot(xvals, b0 + b1*xvals + b2*(xvals**2) + b3*(xvals**3))
Optimization terminated successfully. Current function value: 4274.128398 Iterations: 230 Function evaluations: 407
[<matplotlib.lines.Line2D at 0x108b77690>]
Of course, we need not fit least squares models by hand. The statsmodels
package implements least squares models that allow for model fitting in a single line:
import statsmodels.api as sm
straight_line = sm.OLS(y, sm.add_constant(x)).fit()
straight_line.summary()
Dep. Variable: | y | R-squared: | 0.891 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.864 |
Method: | Least Squares | F-statistic: | 32.67 |
Date: | Sun, 23 Feb 2014 | Prob (F-statistic): | 0.00463 |
Time: | 10:40:26 | Log-Likelihood: | -12.325 |
No. Observations: | 6 | AIC: | 28.65 |
Df Residuals: | 4 | BIC: | 28.23 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | -4.3500 | 2.937 | -1.481 | 0.213 | -12.505 3.805 |
x1 | 3.0000 | 0.525 | 5.716 | 0.005 | 1.543 4.457 |
Omnibus: | nan | Durbin-Watson: | 2.387 |
---|---|---|---|
Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.570 |
Skew: | 0.359 | Prob(JB): | 0.752 |
Kurtosis: | 1.671 | Cond. No. | 17.9 |
from statsmodels.formula.api import ols as OLS
data = pd.DataFrame(dict(x=x, y=y))
cubic_fit = OLS('y ~ x + I(x**2)', data).fit()
cubic_fit.summary()
Dep. Variable: | y | R-squared: | 0.929 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.881 |
Method: | Least Squares | F-statistic: | 19.50 |
Date: | Sun, 23 Feb 2014 | Prob (F-statistic): | 0.0191 |
Time: | 10:40:27 | Log-Likelihood: | -11.056 |
No. Observations: | 6 | AIC: | 28.11 |
Df Residuals: | 3 | BIC: | 27.49 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -11.0748 | 6.013 | -1.842 | 0.163 | -30.211 8.062 |
x | 6.0577 | 2.482 | 2.441 | 0.092 | -1.840 13.955 |
I(x ** 2) | -0.3027 | 0.241 | -1.257 | 0.298 | -1.069 0.464 |
Omnibus: | nan | Durbin-Watson: | 2.711 |
---|---|---|---|
Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.655 |
Skew: | -0.809 | Prob(JB): | 0.721 |
Kurtosis: | 2.961 | Cond. No. | 270. |
How do we choose among competing models for a given dataset? More parameters are not necessarily better, from the standpoint of model fit. For example, fitting a 9-th order polynomial to the sample data from the above example certainly results in an overfit.
def calc_poly(params, data):
x = np.c_[[data**i for i in range(len(params))]]
return np.dot(params, x)
ssp = lambda theta, x, y: np.sum((y - calc_poly(theta, x)) ** 2)
betas = fmin(ssp, np.zeros(10), args=(x,y), maxiter=1e6)
plt.plot(x, y, 'ro')
xvals = np.linspace(0, max(x), 100)
plt.plot(xvals, calc_poly(betas, xvals))
Optimization terminated successfully. Current function value: 7.015262 Iterations: 663 Function evaluations: 983
[<matplotlib.lines.Line2D at 0x108caa450>]
One approach is to use an information-theoretic criterion to select the most appropriate model. For example Akaike's Information Criterion (AIC) balances the fit of the model (in terms of the likelihood) with the number of parameters required to achieve that fit. We can easily calculate AIC as:
$$AIC = n \log(\hat{\sigma}^2) + 2p$$where $p$ is the number of parameters in the model and $\hat{\sigma}^2 = RSS/(n-p-1)$.
Notice that as the number of parameters increase, the residual sum of squares goes down, but the second term (a penalty) increases.
To apply AIC to model selection, we choose the model that has the lowest AIC value.
n = len(x)
aic = lambda rss, p, n: n * np.log(rss/(n-p-1)) + 2*p
RSS1 = ss(fmin(ss, [0,1], args=(x,y)), x, y)
RSS2 = ss2(fmin(ss2, [1,1,-1], args=(x,y)), x, y)
print aic(RSS1, 2, n), aic(RSS2, 3, n)
Optimization terminated successfully. Current function value: 21.375000 Iterations: 79 Function evaluations: 153 Optimization terminated successfully. Current function value: 14.001110 Iterations: 198 Function evaluations: 372 15.7816583572 17.6759368019
Hence, we would select the 2-parameter (linear) model.
Fitting a line to the relationship between two variables using the least squares approach is sensible when the variable we are trying to predict is continuous, but what about when the data are dichotomous?
Let's consider the problem of predicting survival in the Titanic disaster, based on our available information. For example, lets say that we want to predict survival as a function of the fare paid for the journey.
titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.name
0 Allen, Miss. Elisabeth Walton 1 Allison, Master. Hudson Trevor 2 Allison, Miss. Helen Loraine 3 Allison, Mr. Hudson Joshua Creighton 4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 5 Anderson, Mr. Harry 6 Andrews, Miss. Kornelia Theodosia 7 Andrews, Mr. Thomas Jr 8 Appleton, Mrs. Edward Dale (Charlotte Lamson) 9 Artagaveytia, Mr. Ramon ... 1298 Wittevrongel, Mr. Camille 1299 Yasbeck, Mr. Antoni 1300 Yasbeck, Mrs. Antoni (Selini Alexander) 1301 Youseff, Mr. Gerious 1302 Yousif, Mr. Wazli 1303 Yousseff, Mr. Gerious 1304 Zabour, Miss. Hileni 1305 Zabour, Miss. Thamine 1306 Zakarian, Mr. Mapriededer 1307 Zakarian, Mr. Ortin 1308 Zimmerman, Mr. Leo Name: name, Length: 1309, dtype: object
jitter = np.random.normal(scale=0.02, size=len(titanic))
plt.scatter(np.log(titanic.fare), titanic.survived + jitter, alpha=0.3)
plt.yticks([0,1])
plt.ylabel("survived")
plt.xlabel("log(fare)")
<matplotlib.text.Text at 0x108f06050>