In this notebook we will:
scipy.stats.linregress
to run a linear regressionWe will delve into the mathematics involved in defining the linear regression line in another notebook.
# HIDDEN
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import scipy.stats as stats
import scipy.stats
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)
Throughout this class we have routinely added the regression line to our scatterplots using the fit_line = True option inside our scatterplots.
But how can we determine the equation for this line? That method is known as linear regression.
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014.scatter("Critical Reading", "Math", fit_line = True)
In a separate notebook we'll go through the theory behind linear regression. For now, let's just cut the end of that discussion and give the formula for the regression line.
If the following equation is used to predict values of $y$ from values of $x$
$$\hat{y} = b_1x +b_0$$Then $$b_1 = r \frac{S_y}{S_x}$$
Also this line always passes through the point $(\overline{x}, \overline{y})$.
From the two facts above we can determine that:
$$ \begin{array}{rl}b_0&= \overline{y} - b_1\overline{x} \\ &\displaystyle = \overline{y} - r \frac{S_y}{S_x}\overline{x} \end{array}$$Naturally, there is a command that gives us all that plus more information: linregress
in the scipy.stats
module.
sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014.show(5)
scipy.stats.linregress(sat2014.column("Critical Reading"), sat2014.column("Math"))
Does this match the formula above?
print(np.mean(sat2014.column("Critical Reading")))
print(stats.tstd(sat2014.column("Critical Reading")))
print(np.mean(sat2014.column("Math")))
print(stats.tstd(sat2014.column("Math")))
print(stats.pearsonr(sat2014.column("Math"), sat2014.column("Critical Reading")))
0.984755*48.26207/47.9030966
slope = stats.pearsonr(sat2014.column("Math"), sat2014.column("Critical Reading"))[0] * stats.tstd(sat2014.column("Math"))/stats.tstd(sat2014.column("Critical Reading"))
slope
intercept = np.mean(sat2014.column("Math")) - slope * np.mean(sat2014.column("Critical Reading"))
intercept
scipy.stats.pearsonr(sat2014.column("Critical Reading"), sat2014.column("Math"))
The equation is $\hat{y} = 0.99 x + 7.36$
Or $$ \hat{Math} = 0.99 \cdot Reading + 7.36$$
def reg_line(x):
return 0.99*x + 7.36
sat2014.scatter("Critical Reading", "Math")
plots.plot([440, 612],[reg_line(440),reg_line(612)])
plots.title("SATs in 2014");
Just about every part of these graphs can be modified. You can look up how to change things by looking at the matplotlib documentation.
In this example, we change a number of features of this graph, just to show you how. This graph is too garrish to be considered professional, though it is nicely patriotic.
sat2014.scatter("Critical Reading", "Math", edgecolor="blue", facecolor="white", fit_line=True, color="red", marker = "*", s=250)
plots.ylabel("Math", color = "red")
plots.xlabel("Critical Reading", color = "blue")
plots.title("SATs in 2014", c ="gold", size=50);
Recall that scipy.stats.pearsonr
includes a p-value for a hypothesis test with hypotheses
$H_o: \rho = 0$
$H_a: \rho \not = 0$
This is equivalent to a test with these hypotheses
$H_o: \beta_1 = 0$
$H_a: \beta_1 \not = 0$
Furthermore, the scipy.stats.linregress
command runs this equivalent hypothesis test. That's what the p-value reported with the linregress output is.
Remember that the slope of the regression line is $\displaystyle b_1 = r \frac{S_y}{S_x}$. So if we assume $r= 0$ then $b_1 = 0$, too.
If we assume that $b_1 = 0$, that means that either $r = 0$ or $S_y = 0$. The only way $S_y = 0$ is if the data forms a horizontal line, which we'd see in the scatterplot, and we already discussed that makes $r$ undefined.
sat2014.scatter("Critical Reading", "Math", fit_line=True)
plots.title("SAT Section Scores")
plots.text(450, 575, "$\hat{Math} = 0.99\cdot Reading + 7.36$\n$r = 0.985$, $p <0.001$")
plots.text(525, 390, "Figure 1", ha='center');
The data is found here.
If $\beta_1$ refers to the slope of the regression line then the null and alternative hypotheses are:
$H_o: \beta_1 = 0$
$H_a: \beta_1 > 0$
Equivalently, we could state these referring to the correlation, $\rho$.
$H_o: \rho = 0$
$H_a: \rho > 0$
Recall, that the linregress
function will run this hypothesis test, but by default it run a two-tailed test. By a two-tailed test we mean the alternative hypothesis is $\beta_1 \not= 0$.
mask = Table().read_table("mask_data.csv")
mask.show(10)
mask.drop("Country", "Cases").scatter("Compliance", fit_line=True)
results = stats.linregress(mask.column("Compliance"), mask.column("Deaths"))
results
Since our alternative hypothesis had a direction, it is appropriate to half our p-value.
p_value = results[3]/2
p_value
mask.drop("Country", "Cases").scatter("Compliance", fit_line=True)
plots.title("Deaths vs Percent Mask \nCompliance")
plots.text(0.1, 2100, "Deaths = 838 Comp. + 541")
plots.text(.5, -550, "Figure 1", ha='center');
Just because two things are correlation, does not necessarily mean that one causes the other (in either direction). It would be irresponsible to state that wearing masks caused the death rates to increase. But what we can responsibly state is that, in Europe, compliance with the mask mandates did not lower the death rates (since the correlation implies that as compliance increased so did the death rates).
This video gives some interesting examples. In the video, she states that before a correlation can be considered to imply causation, it is important that the researchers know how and why one thing caused the other. I'd add to that causation can also be implied by the results of carefully designed randomized controlled experiment; but those are not the types of studies we hear about most often. They are rare because they are expensive and difficult to run.
from IPython.display import YouTubeVideo
YouTubeVideo("8B271L3NtAw")
births = Table.read_table('baby.csv')
births.relabel("Maternal Pregnancy Weight", "Mother Weight")
births.show(5)
births.scatter("Mother Weight", "Birth Weight", fit_line=True)
stats.stats.linregress(births.column("Mother Weight"), births.column("Birth Weight"))
births.scatter("Mother Weight", "Birth Weight", fit_line=True)
plots.text(140, 55, "Baby = 0.14 Mother + 101.8", color = "red");
galton = Table.read_table('galton.csv')
heights = Table().with_columns(
'MidParent', galton.column('midparentHeight'),
'Child', galton.column('childHeight'))
heights
#galton.show(5)
stats.linregress(heights.column('MidParent'), heights.column('Child'))
heights.scatter('MidParent', fit_line=True, color="blue", facecolor="black")
plots.text(65, 75, "$\hat{Child} = 0.64\cdot MidParent + 22.64$\n$r=0.32$, $p <0.001$", color="blue")
plots.title("Galton Heights Data");
#plots.text(69,50,"Figure 1");
Suppose we are trying to predict horsepower from mpg for cars.
auto = Table.read_table("Auto_Data_Set_963_49.csv")
auto = auto.select("name", "year", "mpg", "horsepower")
auto.show(4)
scipy.stats.linregress(auto.column("mpg"), auto.column("horsepower"))
print("Mean mpg is", round(np.mean(auto.column("mpg"))))
print("St. dev. of mpg is", round(stats.tstd(auto.column("mpg")),1))
print("Mean hp is", round(np.mean(auto.column("horsepower"))))
print("St. dev. of hp is", round(stats.tstd(auto.column("horsepower"))))
print("The correlation is", round(stats.pearsonr(auto.column("mpg"), auto.column("horsepower"))[0],1))
slope = -0.8*38/7.8
slope
print("Mean mpg is", round(np.mean(auto.column("mpg")), 6))
print("St. dev. of mpg is", round(stats.tstd(auto.column("mpg")),6))
print("Mean hp is", round(np.mean(auto.column("horsepower")),6))
print("St. dev. of hp is", round(stats.tstd(auto.column("horsepower")),6))
print("The correlation is", round(stats.pearsonr(auto.column("mpg"), auto.column("horsepower"))[0],6))
slope = -0.778427 * 38.49116 / 7.805007
slope
auto.scatter("mpg", "horsepower")
This relationship is clearly not linear! We shouldn't have performed a linear regression. This is why we should always check the scatterplots first. This is only a waste of our time, if we haven't learned our lesson about checking the scatterplots.
Anyway, it turns out that if we "transform" the mpg variable by taking it's reciprocal, the resulting variable has a linear relationship with hp.
def recip(x):
return x**-1
auto = auto.with_column("mpg_inv", auto.apply(recip, "mpg"))
auto.scatter("mpg_inv", "horsepower")
scipy.stats.linregress(auto.column("mpg_inv"), auto.column("horsepower"))
print("Mean mpg_inv is", round(np.mean(auto.column("mpg_inv")),5))
print("St. dev. of mpg is", round(stats.tstd(auto.column("mpg_inv")),5))
print("Mean hp is", round(np.mean(auto.column("horsepower")),5))
print("St. dev. of hp is", round(stats.tstd(auto.column("horsepower")),5))
print("The correlation is", round(stats.pearsonr(auto.column("mpg_inv"), auto.column("horsepower"))[0],5))
Slope = 0.85481*38.49116/0.01664
Slope
Intercept = 104.46939 - Slope * 0.04782
Intercept
Reporting Out
It turns out that the horsepower is linearly related to the recripocal of the MPG. That linear relationship is $$\hat{hp} = 1977.38 \cdot \frac{1}{mpg} + 9.91$$
The correlation between horsepower and the reciprocal of MPG is 0.855, with a p-value less than 0.001.
auto.scatter("mpg", "horsepower")
plots.text(25, 200, "$\hat{hp} = 1977.38 \cdot mpg^{-1} + 9.91$" )
plots.text(25, 180, "$r = 0.855, p<0.001$")
plots.title("Relationship Between\nHorsepower & MPG");