pandas has quickly become almost a standard for working with structured data in Python. It often makes code much clearer to read, and it also offers powerful tools for simple import and export of data.
statsmodels is an advanced package for statistical modeling with Python. Here we will only touch the surface of its extensive functionality. A more extensive introduction is available under http://nbviewer.ipython.org/gist/vincentarelbundock/3485014
Author : Thomas Haslwanter, Feb-2017
# Here I import numpy explicitly, so as to make clear where each function comes from
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.formula.api as sm
import sys
# "urlopen" is in a different locations in Python2 and Python3
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib import urlopen
# Show plots inline
%matplotlib inline
# To get reproducable values, I provide a seed value
np.random.seed(987654321)
# Generate a noisy line
x = np.arange(100)
y = 0.5*x - 20 + np.random.randn(len(x))
df = pd.DataFrame({'x':x, 'y':y})
# Fit a linear model ...
# Note the formula language used to denote the relationship between "x" and "y"
model = sm.ols('y~x', data=df).fit()
# ... and print an extensive summary of the fit results and model properties
print((model.summary()))
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.995 Model: OLS Adj. R-squared: 0.995 Method: Least Squares F-statistic: 1.819e+04 Date: Sat, 04 Feb 2017 Prob (F-statistic): 4.31e-113 Time: 14:21:44 Log-Likelihood: -147.60 No. Observations: 100 AIC: 299.2 Df Residuals: 98 BIC: 304.4 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -19.9742 0.212 -94.083 0.000 -20.396 -19.553 x 0.4996 0.004 134.857 0.000 0.492 0.507 ============================================================================== Omnibus: 1.158 Durbin-Watson: 2.037 Prob(Omnibus): 0.560 Jarque-Bera (JB): 1.038 Skew: -0.030 Prob(JB): 0.595 Kurtosis: 2.504 Cond. No. 114. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Here I just show how to bring data into the pandas format, and how to use its object oriented notation:
# Get the data from the web
inFile = 'altman_94.txt'
url_base = 'https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/'
url = url_base + inFile
data = np.genfromtxt(urlopen(url), delimiter=',')
# Bring them into pandas format
lean = pd.Series(data[data[:,1]==1,0])
obese = pd.Series(data[data[:,1]==0,0])
df = pd.DataFrame({'lean':lean, 'obese':obese})
# Calculate the mean, ...
print(df.mean())
# ..., show a boxplot, ...
# Note that by using data-frames, the display can get automatically labelled correctly
# - also for the next plot
df.boxplot(return_type='axes')
# ... and find the p-value:
t, pVal = stats.ttest_ind(lean, obese)
if pVal < 0.05:
print('There is a significant difference: p = {0}'.format(pVal))
lean 8.066154 obese 10.297778 dtype: float64 There is a significant difference: p = 0.0007989982111700593