Author: Thomas Haslwanter, Date: Jun-2017
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
from numpy.linalg import lstsq
from urllib.request import urlopen
import statsmodels.api as sm
Fit a line, using the powerful "ordinary least square" method of pandas.
Data from 24 type 1 diabetic patients, relating Fasting blood glucose (mmol/l) to mean circumferential shortening velocity (%/sec), derived form echocardiography.
# Get the data
url_base = 'https://raw.githubusercontent.com/thomas-haslwanter/statsintro_python/master/ipynb/Data/data_altman/'
inFile = 'altman_11_6.txt'
url = url_base + inFile
data = np.genfromtxt(urlopen(url), delimiter=',')
# First I have to delete rows containing "nan"
a,b = np.where(np.isnan(data))
data = np.delete(data, a, axis=0)
x,y = data[:,0], data[:,1]
plt.plot(x,y,'*')
# Create the design matrix
Xmat = sm.add_constant(x)
# Calculate the parameters
params = lstsq(Xmat, y)
np.set_printoptions(precision=3)
print(params)
(array([ 1.098, 0.022]), array([ 0.986]), 2, array([ 54.079, 1.838]))
pandas handles "nan" gracefully, and also provides more information about the fit. So let's use pandas, and compare the results
import statsmodels.formula.api as smf
# Convert them into a pandas DataFrame
df = pd.DataFrame(data, columns=['glucose', 'Vcf'])
model_fit = smf.ols('Vcf~glucose', df).fit()
print(model_fit.summary2())
Results: Ordinary least squares ================================================================= Model: OLS Adj. R-squared: 0.134 Dependent Variable: Vcf AIC: -3.1672 Date: 2017-06-07 22:00 BIC: -0.8962 No. Observations: 23 Log-Likelihood: 3.5836 Df Model: 1 F-statistic: 4.414 Df Residuals: 21 Prob (F-statistic): 0.0479 R-squared: 0.174 Scale: 0.046957 ------------------------------------------------------------------- Coef. Std.Err. t P>|t| [0.025 0.975] ------------------------------------------------------------------- Intercept 1.0978 0.1175 9.3446 0.0000 0.8535 1.3421 glucose 0.0220 0.0105 2.1010 0.0479 0.0002 0.0437 ----------------------------------------------------------------- Omnibus: 1.717 Durbin-Watson: 1.802 Prob(Omnibus): 0.424 Jarque-Bera (JB): 1.270 Skew: 0.562 Prob(JB): 0.530 Kurtosis: 2.756 Condition No.: 29 =================================================================
Pearson correlation, and two types of rank correlation (Spearman, Kendall)
Comparing age and percentage of body-fat (measured by dual-photon absorptiometry) for 18 normal adults.
# Get the data
inFile = 'altman_11_1.txt'
url = url_base + inFile
data = np.genfromtxt(urlopen(url), delimiter=',')
x = data[:,0]
y = data[:,1]
# Calculate correlations
corr = {}
corr['pearson'], _ = stats.pearsonr(x,y)
corr['spearman'], _ = stats.spearmanr(x,y)
corr['kendall'], _ = stats.kendalltau(x,y)
print(corr)
{'pearson': 0.79208623217849128, 'spearman': 0.75387958553761558, 'kendall': 0.57620948508912284}
# Show that Spearman's rho is just the Pearson's R of the rank-ordered data
r_rankordered = stats.pearsonr(stats.rankdata(x), stats.rankdata(y))[0]
print("Spearman's rho = {0:5.3f}, and Pearson's r (rankordered) = {1:5.3f}".format(corr['spearman'], r_rankordered))
Spearman's rho = 0.754, and Pearson's r (rankordered) = 0.754