Linear Regression¶

In this notebook we will:

Use nearest neighbors prediction as stepping stone toward regression
Learn how to use the scipy.stats.linregress to run a linear regression
Learn how to interpret and report the results of a linear regression

We will delve into the mathematics involved in defining the linear regression line in another notebook.

In [ ]:

# HIDDEN
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import scipy.stats as stats

import scipy.stats

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)
    

Linear Regression Equation¶

Throughout this class we have routinely added the regression line to our scatterplots using the fit_line = True option inside our scatterplots.

But how can we determine the equation for this line? That method is known as linear regression.

In [ ]:

sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014.scatter("Critical Reading", "Math", fit_line = True)

Linear regression: cutting to the end¶

In a separate notebook we'll go through the theory behind linear regression. For now, let's just cut the end of that discussion and give the formula for the regression line.

If the following equation is used to predict values of $y$ from values of $x$

$$\hat{y} = b_1x +b_0$$

Then $$b_1 = r \frac{S_y}{S_x}$$

Also this line always passes through the point $(\overline{x}, \overline{y})$.

From the two facts above we can determine that:

$$ \begin{array}{rl}b_0&= \overline{y} - b_1\overline{x} \\ &\displaystyle = \overline{y} - r \frac{S_y}{S_x}\overline{x} \end{array}$$

Naturally, there is a command that gives us all that plus more information: linregress in the scipy.stats module.

In [ ]:

sat2014 = Table.read_table('sat2014.csv').sort('State')
sat2014.show(5)

In [ ]:

scipy.stats.linregress(sat2014.column("Critical Reading"), sat2014.column("Math"))

Does this match the formula above?

In [ ]:

print(np.mean(sat2014.column("Critical Reading")))
print(stats.tstd(sat2014.column("Critical Reading")))
print(np.mean(sat2014.column("Math")))
print(stats.tstd(sat2014.column("Math")))
print(stats.pearsonr(sat2014.column("Math"), sat2014.column("Critical Reading")))

In [ ]:

0.984755*48.26207/47.9030966

In [ ]:

slope = stats.pearsonr(sat2014.column("Math"), sat2014.column("Critical Reading"))[0] * stats.tstd(sat2014.column("Math"))/stats.tstd(sat2014.column("Critical Reading"))
slope

In [ ]:

intercept = np.mean(sat2014.column("Math")) - slope * np.mean(sat2014.column("Critical Reading"))
intercept

In [ ]:

scipy.stats.pearsonr(sat2014.column("Critical Reading"), sat2014.column("Math"))

The equation is $\hat{y} = 0.99 x + 7.36$

Or $$ \hat{Math} = 0.99 \cdot Reading + 7.36$$

In [ ]:

def reg_line(x):
    return 0.99*x + 7.36

sat2014.scatter("Critical Reading", "Math")
plots.plot([440, 612],[reg_line(440),reg_line(612)])
plots.title("SATs in 2014");

Changing parts of graphs¶

Just about every part of these graphs can be modified. You can look up how to change things by looking at the matplotlib documentation.

In this example, we change a number of features of this graph, just to show you how. This graph is too garrish to be considered professional, though it is nicely patriotic.

In [ ]:

sat2014.scatter("Critical Reading", "Math", edgecolor="blue", facecolor="white",  fit_line=True, color="red", marker = "*", s=250)
plots.ylabel("Math", color = "red")
plots.xlabel("Critical Reading", color = "blue")
plots.title("SATs in 2014", c ="gold", size=50);

Recall that scipy.stats.pearsonr includes a p-value for a hypothesis test with hypotheses

$H_o: \rho = 0$

$H_a: \rho \not = 0$

This is equivalent to a test with these hypotheses

$H_o: \beta_1 = 0$

$H_a: \beta_1 \not = 0$

Furthermore, the scipy.stats.linregress command runs this equivalent hypothesis test. That's what the p-value reported with the linregress output is.

Remember that the slope of the regression line is $\displaystyle b_1 = r \frac{S_y}{S_x}$. So if we assume $r= 0$ then $b_1 = 0$, too.

If we assume that $b_1 = 0$, that means that either $r = 0$ or $S_y = 0$. The only way $S_y = 0$ is if the data forms a horizontal line, which we'd see in the scatterplot, and we already discussed that makes $r$ undefined.

Reporting Out¶

The regression equation is that $\hat{Math} = 0.99 \cdot \mathrm{Reading} + 7.36$; the correlation is 0.985, both the correlation and the slope of the regression line are significantly different from 0 (p <0.001). See Figure 1 below.¶

In [ ]:

sat2014.scatter("Critical Reading", "Math", fit_line=True)
plots.title("SAT Section Scores")
plots.text(450, 575, "$\hat{Math} = 0.99\cdot Reading + 7.36$\n$r = 0.985$, $p  <0.001$")
plots.text(525, 390, "Figure 1", ha='center');

Is mask compliance linked to higher COVID-19 death rates in Europe?¶

The data is found here.

If $\beta_1$ refers to the slope of the regression line then the null and alternative hypotheses are:

$H_o: \beta_1 = 0$

$H_a: \beta_1 > 0$

Equivalently, we could state these referring to the correlation, $\rho$.

$H_o: \rho = 0$

$H_a: \rho > 0$

Recall, that the linregress function will run this hypothesis test, but by default it run a two-tailed test. By a two-tailed test we mean the alternative hypothesis is $\beta_1 \not= 0$.

In [ ]:

mask = Table().read_table("mask_data.csv")

mask.show(10)

In [ ]:

mask.drop("Country", "Cases").scatter("Compliance", fit_line=True)

In [ ]:

results = stats.linregress(mask.column("Compliance"), mask.column("Deaths"))
results

Since our alternative hypothesis had a direction, it is appropriate to half our p-value.

In [ ]:

p_value = results[3]/2
p_value

Reporting Out¶

The regression equation is that $\hat{Deaths} = 837.6 \cdot \mathrm{Compliance} + 540.9$; the correlation is 0.302, both the correlation and the slope of the regression line are significantly greater than 0 (p = 0.039). See Figure 1 below.¶

In [ ]:

mask.drop("Country", "Cases").scatter("Compliance", fit_line=True)
plots.title("Deaths vs Percent Mask \nCompliance")
plots.text(0.1, 2100, "Deaths = 838 Comp. + 541")
plots.text(.5, -550, "Figure 1", ha='center');

Correlation is not the same as causation¶

Just because two things are correlation, does not necessarily mean that one causes the other (in either direction). It would be irresponsible to state that wearing masks caused the death rates to increase. But what we can responsibly state is that, in Europe, compliance with the mask mandates did not lower the death rates (since the correlation implies that as compliance increased so did the death rates).

This video gives some interesting examples. In the video, she states that before a correlation can be considered to imply causation, it is important that the researchers know how and why one thing caused the other. I'd add to that causation can also be implied by the results of carefully designed randomized controlled experiment; but those are not the types of studies we hear about most often. They are rare because they are expensive and difficult to run.

In [ ]:

from IPython.display import YouTubeVideo

YouTubeVideo("8B271L3NtAw")

More Examples of Linear Regression¶

In [ ]:

births = Table.read_table('baby.csv')
births.relabel("Maternal Pregnancy Weight", "Mother Weight")
births.show(5)

In [ ]:

births.scatter("Mother Weight", "Birth Weight", fit_line=True)

In [ ]:

stats.stats.linregress(births.column("Mother Weight"), births.column("Birth Weight"))

In [ ]:

births.scatter("Mother Weight", "Birth Weight", fit_line=True)
plots.text(140, 55, "Baby = 0.14 Mother + 101.8", color = "red");

Galton height data¶

In [ ]:

galton = Table.read_table('galton.csv')

heights = Table().with_columns(
    'MidParent', galton.column('midparentHeight'),
    'Child', galton.column('childHeight'))
heights

#galton.show(5)

In [ ]:

stats.linregress(heights.column('MidParent'), heights.column('Child'))

Reporting Out¶

The predicted childs height is given by $\hat{Child} = 0.64\cdot MidParent + 22.64$ and the correlation between child height and midparent height is 0.32. At the 5% level, the slope and the correlation are significantly different from 0 with p <0.001. See Figure 1 below.¶

In [ ]:

heights.scatter('MidParent', fit_line=True, color="blue", facecolor="black")
plots.text(65, 75, "$\hat{Child} = 0.64\cdot MidParent + 22.64$\n$r=0.32$, $p <0.001$", color="blue")
plots.title("Galton Heights Data");
#plots.text(69,50,"Figure 1");

In [ ]:

Example of a time when linear regression is not appropriate and how to fix it.¶

Suppose we are trying to predict horsepower from mpg for cars.

In [ ]:

auto = Table.read_table("Auto_Data_Set_963_49.csv")
auto = auto.select("name", "year", "mpg", "horsepower")
auto.show(4)

In [ ]:

scipy.stats.linregress(auto.column("mpg"), auto.column("horsepower"))

Be careful with rounding too early.¶

In [ ]:

print("Mean mpg is", round(np.mean(auto.column("mpg"))))
print("St. dev. of mpg is", round(stats.tstd(auto.column("mpg")),1))
print("Mean hp is", round(np.mean(auto.column("horsepower"))))
print("St. dev. of hp is", round(stats.tstd(auto.column("horsepower"))))
print("The correlation is", round(stats.pearsonr(auto.column("mpg"), auto.column("horsepower"))[0],1))

In [ ]:

slope = -0.8*38/7.8
slope

In [ ]:

print("Mean mpg is", round(np.mean(auto.column("mpg")), 6))
print("St. dev. of mpg is", round(stats.tstd(auto.column("mpg")),6))
print("Mean hp is", round(np.mean(auto.column("horsepower")),6))
print("St. dev. of hp is", round(stats.tstd(auto.column("horsepower")),6))
print("The correlation is", round(stats.pearsonr(auto.column("mpg"), auto.column("horsepower"))[0],6))

In [ ]:

slope = -0.778427 * 38.49116 / 7.805007
slope

In [ ]:

auto.scatter("mpg", "horsepower")

$$ \hat{hp} = -3.83 \cdot mpg + 194.48 $$

This relationship is clearly not linear! We shouldn't have performed a linear regression. This is why we should always check the scatterplots first. This is only a waste of our time, if we haven't learned our lesson about checking the scatterplots.

Anyway, it turns out that if we "transform" the mpg variable by taking it's reciprocal, the resulting variable has a linear relationship with hp.

In [ ]:

def recip(x):
    return x**-1

auto = auto.with_column("mpg_inv", auto.apply(recip, "mpg"))

auto.scatter("mpg_inv", "horsepower")

In [ ]:

scipy.stats.linregress(auto.column("mpg_inv"), auto.column("horsepower"))

In [ ]:

print("Mean mpg_inv is", round(np.mean(auto.column("mpg_inv")),5))
print("St. dev. of mpg is", round(stats.tstd(auto.column("mpg_inv")),5))
print("Mean hp is", round(np.mean(auto.column("horsepower")),5))
print("St. dev. of hp is", round(stats.tstd(auto.column("horsepower")),5))
print("The correlation is", round(stats.pearsonr(auto.column("mpg_inv"), auto.column("horsepower"))[0],5))

In [ ]:

Slope = 0.85481*38.49116/0.01664
Slope

In [ ]:

Intercept = 104.46939 - Slope * 0.04782
Intercept

Reporting Out

It turns out that the horsepower is linearly related to the recripocal of the MPG. That linear relationship is $$\hat{hp} = 1977.38 \cdot \frac{1}{mpg} + 9.91$$

The correlation between horsepower and the reciprocal of MPG is 0.855, with a p-value less than 0.001.

In [ ]:

auto.scatter("mpg", "horsepower")
plots.text(25, 200, "$\hat{hp} = 1977.38 \cdot mpg^{-1} + 9.91$" )
plots.text(25, 180, "$r = 0.855, p<0.001$")
plots.title("Relationship Between\nHorsepower & MPG");