Notebook

Previous: Chapter 11 - Adding an Explanatory Variable

In [ ]:

# Run this first so it's ready by the time you need it
install.packages("dplyr")
install.packages("supernova")
install.packages("ggformula")
library(dplyr)
library(supernova)
library(ggformula)

Chapter 12 - Quantitative Predictor Models¶

In the previous chapter we figured out how to add an explanatory variable to a model. We explained thumb length using sex, and noticed how doing so reduced the error in our model and made our predictions more accurate.

In that example, Sex is a categorical variable. It has two easily distinguishable groups and separate means for each of the groups that we can model with $b_0$ and $b_1$ in the general linear model equation. But what about a different sort of variable, like height? How do we take into account interval variables like this that don't have clearly distinguishable group means?

One option is to create categorical groups out of this data. We could say anyone who is shorter than the mean height is short, and anyone taller than the mean height is tall. That way we now have a categorical variable with two levels, "short" and "tall", and we can use it in a model the same way we used Sex.

There are some problems with this approach, however. First, remember that the definition of a categorical variable is one where there is no quantitative relationship between the different categories. That holds true for a variable like Sex - it doesn't make sense to say the "female" level is any more or less than the "male" level. But for categories like "short" and "tall", there is an inherent quantitative relationship between them. "Short" means less than "tall" by definition. So forcing Height to be a categorical variable is mispecifying the meaning of Height.

The other problem is that by forcing all values in Height to be in one category or another, we are inherently throwing away information that that variable can give us for the purposes of modeling the data generation process. Less information means less flexible models and worse predictions. We'll explore this is greater detail later in the chapter.

So instead of using height in inches to divide students into groups (e.g., short or tall), let's just model height as a continuous interval variable. Earlier in the course we learned how to visualize interval data in a scatterplot. In this chapter we will figure out how to extend our models to accommodate quantitative explanatory variables.

The models we develop in this chapter are a special type usually called regression models. Before we start, though, note that the core ideas behind these new models are exactly the same as those we have developed for group-based models. A model still yields a single predicted score for each observation, based on some mathematical function that includes the explanatory variable. Further, in regression models, we still use residuals (the difference between the predicted and observed score) to measure error around the model. We also still use the sum of squared deviations from the model predictions as a measure of model fit. And we still use PRE to indicate the proportion reduction in error of the regression model compared with the null model. It's almost all the same!

12.1 The two-group model review¶

Let’s return to our tiny_data data set, but this time add Height as an explanatory variable for predicting Thumb lengths.

In [ ]:

student_ID <- c(1, 2, 3, 4, 5, 6)
Thumb <- c(56, 60, 61, 63, 64, 68)
Height <- c(62, 66, 67, 63, 68, 71)

tiny_data <- data.frame(student_ID, Thumb, Height)
tiny_data

Now, let's make a new variable that recodes Height as groups, HeightGroup.

In [ ]:

#Make Boolean variable, 0 = less than avg height 1 = greater than or equal to avg height
#Then recode to meaningful labels
tiny_data <- mutate(tiny_data, HeightGroup = Height >= mean(Height)) %>%
    mutate(., HeightGroup = case_match(HeightGroup, 0 ~ "short",
                                                    1 ~ "tall"))

tiny_data

In the graph below, we illustrate the two-group approach to modeling this data. Just as a reminder, we would specify this model — the HeightGroup model — as a two-parameter model, with one parameter being the mean thumb length for short people, the other, the difference in group means for short and tall people.

We can fit a model to this data and compare it to the null model, which is represented in the graph above with the blue horizontal line. The predicted score for everyone in the null model is the Grand Mean.

We can see just by inspecting the graph above, without even making an ANOVA table, that the two-group model will yield better predictions than the null model. Knowing whether someone is short or tall would help us make a more accurate prediction of their thumb length than if we didn’t know which group they were in and made predictions with the Grand Mean only.

But we can also generate an ANOVA table for the HeightGroup model to compare errors between it and the null model. The $SS_{total}$ is the sum of squared deviations in the null model. The $SS_{error}$ is the deviation of each score from its group mean (short or tall).

In [ ]:

height_group_model <- lm(Thumb ~ HeightGroup, data=tiny_data)

height_group_model

supernova(height_group_model)

The continous variable model¶

Now let’s compare what happens when we plot Height as a quantitative variable instead of as a grouping variable.

The scatterplot (right) shows Height on the x-axis measured in inches (the explanatory variable) and thumb length measured in millimeters on the y-axis (the outcome variable). We have included the HeightGroup plot on the left, for comparison. The same six data points are represented in the graph on the left (which uses HeightGroup as the explanatory variable) as in the graph on the right, (which uses Height in inches). Whereas the points on the left are categorized into two groups, the points on the right are spread out over the x-axis, with each person’s height and thumb length represented as a point in two-dimensional space.

The null model can be represented the same way in both graphs, by a horizontal line through the Grand Mean.

The empty model of the outcome variable is exactly the same no matter how you code the explanatory variable: it is just the mean of the six data points in the outcome variable. It is the baseline model performance we start with before taking any explanatory variables into account.

However, we get a sense from the scatterplot that knowing someone’s exact height would help us make a better prediction of their thumb length than just knowing whether they are short or tall. But where, in the scatterplot, is the model? What are the groups to make means for?

When the explanatory variable is categorical, as with HeightGroup, we can use the group means as our model (as reviewed above). But when the explanatory variable is quantitative, as with Height in inches, we can’t use the group means to predict new scores because there are no groups.

12.2 The regression line as a model¶

When both the outcome and explanatory variables are quantitative, we need to use the model differently. We can’t use the group mean as a model, because we don’t have groups. Instead we use a line, the regression line.

The regression line has a lot more in common with the mean as a model than you might at first think. Recall last chapter when we learned how the general linear model is a lot like the equation for a line, y = mx + b. It's like describing the intercept and slope of a line that connects the mean of one group to the mean of the other group. Where the line overlaps with the group values is the prediction we would make for a new data point with that group label.

When our explanatory variable is quantitative, we can still draw a line through all the points. The line, just as before, represents the prediction $Y_i$ that we would make for a new data point with some explanatory variable value $X_i$.

Both the Grand Mean and the regression line, when used as models, generate predicted scores on the outcome variable, both for existing data points and for new observations that might be created in the future. The two graphs below illustrate how these predictions work for the null model (left) and the Height regression model (right).

In the left panel of the figure above, the blue horizontal line is drawn at the Grand Mean of the outcome variable, Thumb, representing the null model. We have indicated the model prediction for each data point, represented as blue dots at the Grand Mean. Each blue dot represents a person’s height and predicted thumb length according to the null model. For example, one student has a height of 66 inches. Using the null model, their predicted thumb length (where the blue line intersects X=66) is the mean, or 62 millimeters. As you can see, in the null model, every person has the same predicted thumb length: 62 millimeters.

On the right in orange, we have drawn a regression line, which represents the Height model, over the same tiny_data data points. In the Height model, we use information about each person’s height to predict their thumb length. The orange dots on the regression line represent the predicted Thumb values for each of the six data points in the tiny_data dataset. Notice that the orange dots differ depending on height. Thus, we predict a longer thumb for someone who is 68 inches than for someone 66 inches tall.

Using the regression line, we can even predict thumb lengths for values of height that do not exist in our data. To predict the thumb length of someone 64 inches tall, find 64 inches on the x-axis. Then go up vertically to the regression line to find the model’s prediction for that X value. Even though no one in this dataset is 64 inches tall, the regression line lets us predict a thumb length for someone like this: 60 mm.

12.3 Error in the regression model¶

As we have said all along, all models are wrong. What we are looking for is a model that is better than nothing, or, as you might by now suggest, better than the null model. Comparing error between the Height model and the null model lets us see which model is less wrong.

Error, in both the null model and the regression model, is defined as the residual, or the observed score minus the score predicted by the model for each data point. Error in the null model is the deviation of an observed score from the Grand Mean. Error in the regression model is the deviation of the observed score from the regression line, measured in vertical distance.

If the regression model is a better predictor than the null model, then the squared sum of these residuals should be smaller in the regression model than the null model. In fact, that's how we find the exact intercept and slope of the regression line to use for predictions - the line that minimizes error between its predictions and the real data points.

Recall that the mean is the middle of a univariate distribution (distribution of one variable), equally balancing the residuals above and below. In a similar way, the regression line is the middle of a bivariate distribution, the pattern of variation in two quantitative variables. Just as the sum of the residuals around the mean add up to 0, so too the sum of the residuals around the regression line also add up to 0.

Here’s another cool relationship between the mean and regression line — if someone had an average height, intuitively we would predict that their thumb length might also be average. And it turns out that the regression line passes through a point that is both the mean of the outcome variable and of the explanatory variable.

12.4 Specifying the model in GLM notation¶

Let’s look at how we specify the model in the case where we have a single quantitative explanatory variable (such as Height). We will write the model like this:

$$ Y_i = b_0 + b_1X_i + e_i $$

It might be useful to compare this notation to that used in the previous chapter to specify the group model (e.g., using Sex or Height2Group as a predictor):

$$ Y_i = b_0 + b_1X_i + e_i $$

It’s not a typo: both of the model specifications are identical. This, in fact, is what is so beautiful about the General Linear Model. It is simple and elegant and can be applied across a wide variety of situations, including situations with categorical or quantitative explanatory variables.

Although both models are specified using the same notation, the interpretation of the equation components varies from situation to situation. It is always important to think, first, about what each component of the model means for your use case.

As before in our DATA = MODEL + ERROR framework, $Y_i$ is the DATA, and $e_i$ is the ERROR. $b_0 + b_1X_i$ represents MODEL. Let’s now think about what each of these elements means in the context of our tiny_data dataset. Let’s also consider what might be different about the interpretation in the current case, with height as a quantitative variable, with the previous case in which height was coded as categories (short vs. tall).

Both $Y_i$ and $e_i$ have the same interpretation in the regression model as in the group model. The outcome variable in both cases is the thumb length of each person, measured as a quantitative variable. And the error term is each person’s deviation from their predicted thumb length.

The explanatory variable $X_i$ has a different meaning in these two models: This may be more clear when we write the two models like this:

Group Model: $Thumb_i = b_0 + b_1HeightGroup_i + e_i$

Regression Model: $Thumb_i = b_0 + b_1Height_i + e_i$

In the group model, the variable $X_i$ (HeightGroup) was coded as "tall" or "short". Because it divided people into one of two categories, the model could only make the simple prediction of one mean thumb length for short people, and another for tall people. In the regression model, in contrast, $X_i$ (Height) is the actual measurement of person i’s height in inches. This model can make a different prediction of thumb length for every possible value of height.

Coding $X_i$ in these different ways leads to different, but related, interpretations of the $b_1$ coefficient. In the group model, you will recall, the coefficient represents the increment in millimeters that must to be added to the mean thumb length for short people to get the mean thumb length for tall people. This increment is added to the prediction only when $X_i$ is equal to 1 in the dummy variable version of HeightGroup.

Because $X_i$ in the regression model represents the measured height of each individual in inches, $b_1$ represents the increment that must be added to the predicted thumb length of a person for each one-unit increment in height. If that sounds familiar to you, it may be because it sounds exactly like the definition of the slope of a line (how much “rise” for each one unit of “run”). $b_1$ is, in fact, the slope of the best-fitting regression line.

If $b_1$ is the slope of the line, it stands to reason that $b_0$ will be the y-intercept of the line. In other words, the predicted value of Y when X is equal to 0. Because the regression line is a line, it will need to be defined by a slope and an intercept.

Does it make sense for someone to have a height of 0? In the case of a height model trying to predict thumb length, the intercept is purely theoretical — it’s impossible for someone to be 0 inches tall! So it’s pretty weird to predict thumb length for height of 0, but technically the model can make a prediction anyways. Knowing this intercept, regardless of whether this prediction is actually ever made, is important simply for defining the line that helps us make predictions for realistic values.

We have summarized these differences between cateogrical and regression models in the table below.

12.5 Fitting a regression model¶

Since the statistical equation for the categorical model and the regression model is the same, fitting — or estimating the parameters — for the regression model is accomplished the same way as fitting the group model. It’s all done using the lm() function in R.

The lm() function is smart enough to know that if the explanatory variable is quantitative, it should estimate the regression model. If the explanatory variable is categorical (e.g., defined as a character or factor in R), lm() will fit a group model.

In fact, under the hood lm() actually treats the group model as a regression model. This is why it converts categorical labels to dummy versions of the variable, with 0 for one level and 1 for another. This way, calculating the slope of a regression line - a one-unit increase in the explanatory variable - involves finding the change in Y corresponding with going from X=0 to X=1. In other words, going from one level to the other.

Modify the code below to fit the regression model using Height as the explanatory variable and predicting Thumb in the tiny_data data.

In [ ]:

# modify this to fit Thumb as a function of Height
tiny_Height_model <- lm()

# this prints the best-fitting estimates
tiny_Height_model

Although R is pretty smart about knowing which model to fit, it won’t always think of your data values in the same way you do. If you code the grouping variable with the character strings “short” and “tall,” R will make a dummy variable out of them with 0's and 1's as levels. But if you code the same grouping variable as 1's and 2's yourself, and you forget to make it a factor, R may get confused and fit the model as though the quantitative value of each level is 1 and 2, meaning a y-intercept at 0 would be different than we expect.

For example, we’ve added a new variable to tiny_data called GroupNum. Here is what the data look like.

In [ ]:

tiny_data$GroupNum <- c(1,1,2,1,2,2)

tiny_data

If you take a look at the variables HeightGroup and GroupNum, they have the same information. Students 1, 2, and 4 are in one group and students 3, 5, and 6 are in another group. If we fit a model with HeightGroup (and called it the HeightGroup_model) or GroupNum (and called it the GroupNum_model), we would want the model to have the same coefficient estimates. Let's try it.

In [ ]:

# fit a model of Thumb length based on Height2Group
HeightGroup_model <- lm()

# fit a model of Thumb length based on GroupNum
GroupNum_model <- lm()

# this prints the parameter estimates from the two models
HeightGroup_model
GroupNum_model

$b_0$ is the y-intercept of the line - where X=0. In a group model, we can interpret it as the mean of the reference group, but only if that reference is coded as 0. If we code it as 1 and another group as 2, that implies the existence of another possible value for X (at 0). Thus, $b_0$ will represent that value instead of the mean of our reference group.

Now that you have looked in detail at the tiny set of data, fit the height model to the full studentdata data frame, and save the model in an R object called Height_model.

In [ ]:

studentdata <- read.csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/studentdata.csv")

# modify this to fit the Height model of Thumb for the Fingers data
Height_model <- lm()

# this prints best estimates
Height_model

Here is the code to make a scatterplot to show the relationship between Height (on the x-axis) and Thumb (on the y-axis). Note that the code also overlays the best-fitting regression line on the scatterplot.

In [ ]:

gf_point(Thumb ~ Height, data = studentdata, size = 3) %>%
    gf_lm(color = "orange")

12.6 Using the regression line to make predictions¶

The specific regression line, defined by its slope and intercept, is the one that fits our data best. By this we mean that this model reduced leftover error to the smallest level possible given our variables. Specifically, the sum of squared deviations around this line are the lowest of any possible line we could have used instead. Like the null and group models, error around the regression line is also balanced. You can almost imagine the data points each pulling on the regression line and the best fitting regression line balances the “pulls” above and below it.

This regression model also is our best estimate of the relationship between height and thumb length in the population. As with other estimates of the population, we can use the regression model to predict future observations. To do so we must turn it into an equation, one that will predict thumb length based on height.

Here is the fitted model for using Height to predict Thumb based on the complete studentdata dataset, where $Y_i$ is each observation of Thumb and $X_i$ is each observation of Height:

$$ Y_i = 2.84 + 0.87X_i + e_i $$

Remember, an equation takes in some input and spits out a prediction based on a model. Here is the equation we can use to predict a thumb length based on a person’s height:

$$ \hat{Y}_i = 2.84 + 0.87X_i $$

With the group model it was easy to make predictions from the model: no calculation was required to see that if the person was female, the prediction would be the mean for female people; and if the person was male, the prediction would be the mean for male people. But with the regression model it’s harder to do the calculation in your head.

Remember we can use the $coefficients element of a model object to pull out the coefficient estimates $b_0$ and $b_1$. We can pull out the value of specific ones with indexing. So if we wanted to generate a predicted thumb length using the Height_model for someone who is 60 inches tall, we could write:

In [ ]:

#using double-bracket indexing ([[]]) instead of single bracket removes the names of the coefficients
#try using single-bracket indexing to see this difference
b0 <- Height_model$coefficients[[1]]
b1 <- Height_model$coefficients[[2]]

b0 + b1*60

#how would you output a prediction for someone who is 73.5 inches tall?

This code works fine for making individual predictions, but to check our model quality overall, we would want to generate predictions and residuals for each student in the studentdata data frame. As we’ve said before, we really don’t need predictions when we already know their actual thumb lengths. But this is a way to see how well (or how poorly) the model would have predicted the thumb lengths for the students in our data set.

12.7 Examining regression model fit¶

You probably remember from the previous chapters how to save the residuals from a model as well. We can do the same thing with a regression model: whenever we fit a model, we can generate both predictions and residuals from the model. Try to generate the residuals from the Height_model that you fit to the full studentdata dataset.

In [ ]:

# modify to save the residuals from Height_model
studentdata$Height_resid <- resid()

# modify to make a histogram of Height_resid
gf_histogram()

Finally, let’s examine the fit of our regression model by running the supernova() function on our model. At the same time, let’s compare the table we get from the regression model (Height_model) with the one produced for a HeightGroup_model.

In [ ]:

studentdata$HeightGroup <- studentdata$Height >= mean(studentdata$Height, na.rm=TRUE)
HeightGroup_model <- lm(Thumb ~ HeightGroup, data=studentdata)

print("HeightGroup model")
supernova(HeightGroup_model)
print("Height model")
supernova(Height_model)

Remember, the total sum of squares is the sum of squared residuals in the null model. Total sum of squares is all about the outcome variable, and isn’t affected by the explanatory variable.

For any model with an explanatory variable (what we have been calling “full models”), the $SS_{total}$ can be partitioned into the $SS_{error}$ and the $SS_{model}$. The $SS_{model}$ is the amount by which the error is reduced by the full model (e.g., the Height model) compared with the null model.

As we developed previously for a group model, $SS_{model}$ is easily calculated by subtracting $SS_{error}$ from $SS_{total}$. This is the same, regardless of whether you are fitting a group model or a regression model. Error from the model is defined in the former case as residuals from the group means, and in the latter, residuals from the regression line.

PRE has the same interpretation in the context of regression models as it does for group models. As we have pointed out, the total sum of squares is the same for both models. And the PRE is obtained in both cases by dividing $SS_{model}$ by $SS_{total}$.

Many statistics textbooks emphasize the difference between group-based models (using t-tests) and regression models. They try to get students to think of them as separate things. But in fact, the two types of models are fundamentally the same and easily incorporated into the General Linear Model framework.

12.8 Comparing different full models¶

As we can see by comparing the "Error" and "Total" lines, both the HeightGroup model and the Height model reduce error compared to the null model. They both are better models than using the Grand Mean to make predictions on Thumb. But how do they compare to each other?

We learned last chapter that PRE stands for the proportion of total variance that a model accounts for. So, if we want to know which model is better for us to use here, we can simply see which gave us a bigger PRE score! Using this standard, we can decide that the better model is the Height model, as it accounts for 11.79% of the variance in Thumb whereas the HeightGroup model only accounts for 7.8%.

Actually, we're being a little too simplistic right now. You can't always directly compare PRE scores between models to decide which is better. You can only do that in the case of models that have the same number of parameters (which these models do). This is because PRE will always increase when you add more predictors, whether or not those predictors are actually helpful. When you start making more complex models with more parameters and comparing them to simpler models, we have to use a different statistic to see if that difference is meaningful. But more on that next chapter.

The reason the Height model did a better job is because, as we mentioned before, by shoving all the datapoints from Height into one of two categories, we removed information that this variable was providing. There are now only two unique values instead of many. With less information, we have less predictive ability.

12.9 Historical note - Correlation¶

You might have heard of Pearson’s r, more commonly referred to as a correlation coefficient. It's again another historical tool, usually taught separately from other concepts, but which is actually just a special case of regression in which both the outcome and explanatory variables are standardized as z-scores prior to analysis.

Let’s see what happens when we transform the two variables we have been working with: Thumb and Height. Because both variables are transformed into z-scores, the mean of each distribution will be 0, and the standard deviation will be 1. The function scale() will convert all the values in a variable to z-scores.

In [ ]:

# this transforms all Thumb lengths into zscores
studentdata$Thumb_z <- scale(studentdata$Thumb)

# modify this to do the same for Height
studentdata$Height_z <-

Let’s make a scatterplot of Thumb_z and Height_z, and compare it to a scatterplot of Thumb and Height.

In [ ]:

# this makes a scatterplot of the raw scores
# size makes the points bigger or smaller
gf_point(Thumb ~ Height, data = studentdata, size = 4, color = "black")

# modify this to make a scatterplot of the zscores
# feel free to change the colors
gf_point( #FORMULA HERE, data = studentdata, size = 4, color = "firebrick")

Compare the scatterplot of Thumb by Height (in black) with the scatterplot of Thumb_z by Height_z (in firebrick). How are they similar? How are they different?

Z-scoring doesn't change the position of datapoints relative to each other. It just changes the unit on which they are measured.

Now fit a model using the z-score version of each variable:

In [ ]:

#fit a regression model Thumb_z ~ Height_z
Height_z_model <- lm(Thumb_z ~ Height_z, data = studentdata)

# compare ANOVA tables for each model
supernova(Height_model)
supernova(Height_z_model)

Looking at the PRE scores for both models, the fit of the models is identical. This is because all we have changed is the unit in which we measure the outcome and explanatory variables. Unlike PRE, which is a proportion of the total, SS are expressed in the units of the measurement. So if we converted the mm (for Thumb length) and inches (for Height) into cm, feet, etc, the SS would change to reflect those new units.

Transforming both outcome and explanatory variables can help us assess the strength of a relationship between two quantitative variables another way, independent of the units on which each variable is measured. The slope of the regression line between the standardized variables is also called the correlation coefficient, or Pearson’s r. Thus we can calculate Pearson’s r by transforming the variables into z-scores, fitting a regression line, and pulling out $b_1$:

In [ ]:

Height_z_model$coefficient[2]

It seems like a lot of work to transform variables into z-scores and then fit a regression line. Fortunately, R provides an easy way to directly calculate the correlation coefficient r: the cor() function.

In [ ]:

# this calculates the correlation of Thumb and Height
cor(studentdata$Thumb, studentdata$Height)

cor() is one of those functions that is sensitive to NAs in the data, although it doesn't handle them in the same way as mean() or sd() does. To ignore NAs, you need to add the use="pairwise.complete.obs" name-value argument to tell R to skip over any observations that are missing a value in at least one of the variables. This flag tells R to only use observations that are "pairwise complete" - having values (complete) on both variables (pairwise). Note that it's okay if there are NAs in other variables of a dataset for a particular observation, so long as they aren't in the variables we're making a correlation with right now.

In [ ]:

height_w_NA <- studentdata$Height
height_w_NA[10] <- NA

#This will output NA, since the 10th item of height_w_NA is missing
cor(studentdata$Thumb, height_w_NA)

#This fixes it by leaving out that observation on both variables
cor(studentdata$Thumb, height_w_NA, use="pairwise.complete.obs")

When r = 0.34, the slope of the regression line when both variables are standardized is 0.34. This means that an observation that is one standard deviation above the mean on the explanatory variable (x-axis) is predicted to be 0.34 standard deviations above the mean on the outcome variable (y-axis).

Remember, the best-fitting regression model for Thumb by Height was $Y_i = 2.84 + 0.87X_i$. The slope of 0.87 indicates an increment of 0.87 mm of Thumb length for every one inch increase in Height. We will call this slope the unstandardized slope.

While the slope of the standardized regression line is r, the slope of the unstandardized regression line is not directly related to r. However, there is a way to see the strength of the relationship in an unstandardized regression plot; it’s just not the slope coefficient!

In an unstandardized scatterplot we can “see” the strength of a relationship between two quantitative variables by looking at the closeness of the points to the regression line. Correlation r is directly related to how closely the points adhere to the regression line. If they cluster tightly, it indicates a strong relationship; if they don’t, it indicates a weak or nonexistent one.

Just as z-scores let us compare scores from different distributions, the correlation coefficient gives us a way to compare the strength of a relationship between two variables with that of other relationships between pairs of variables measured in different units.

For example, compare the scatterplots below for Height predicting Thumb (on the left) and Pinkie predicting Thumb (on the right).

Correlation measures tightness of the data around the line — the strength of the linear relationship. Judging from the scatterplots, we would say that the relationship between Thumb and Pinkie is stronger because the linear Pinkie model would probably produce better predictions. Another way to say that is this: there is less error around the Pinkie regression line.

Try calculating r for each relationship. Are our intuitions confirmed by the correlation coefficients?

In [ ]:

# this calculates the correlation of Thumb and Height
cor(studentdata$Thumb, studentdata$Height)

# calculate the correlation of Thumb and Pinkie

The correlation coefficients confirm that pinkie finger length has a stronger relationship with thumb length than height. This makes sense because the points in the scatterplot of Thumb by Pinkie are more tightly clustered around the regression line than in the scatter of Thumb by Height.

Just as the slope of a regression line can be positive or negative, so can a correlation coefficient. Correlation r can range from +1 to -1. A correlation coefficient of +1 means that score of an observation on the outcome variable can be perfectly predicted by the observation’s score on the explanatory variable, and that the higher the score on one, the higher on the other.

A correlation coefficient of -1 means the outcome score is just as predictable, but in the opposite direction. With a negative correlation, the higher the score on the explanatory variable, the lower the score on the outcome variable. The image below shows what scatterplots would look like for a perfect correlation of +1, no correlation, and a perfect correlation of -1.

As you get more familiar with statistics you will be able to look at a scatterplot and guess the correlation between the two variables, without even knowing what the variables are! Here’s a game that lets you practice your guessing. Try it, and see what your best score is!

http://www.rossmanchance.com/applets/GuessCorrelation.html

The unstandardized slope represents steepness of the best-fitting line. Correlation represents the strength of the linear relationship, which is indicated by how close the data points are to the regression line, regardless of slope.

Let’s think this through more carefully with the two scatterplots and best-fitting regression lines presented in the table below. For a brief moment, let’s give up on Thumb and try instead to predict the length of middle and pinkie fingers. On the left, we have tried to explain variation in Middle (outcome variable) with Pinkie (explanatory variable). On the right, we have simply reversed the two variables, explaining variation in Pinkie (outcome) with Middle (explanatory variable).

The slopes of these two lines are different. The one on the left is steeper (0.92), the one on the right more gradual in its ascent (0.61). Despite this difference, however, we know that the strength of the relationship, as captured by r, is the same in both cases: they are the same two variables.

The slope difference in this case is due to the fact that middle fingers are a bit longer than pinkies. So, a millimeter difference in Pinkie is a bigger deal than a millimeter difference in Middle. When the pinkie is a millimeter longer, the middle finger is 0.92 millimeters longer, on average; but when the middle finger is a millimeter longer, the pinkie is only up by 0.61 millimeters on average.

The unstandardized effect estimate (the $b_1$ coefficient) must always be interpreted in context. It will depend on the units of measurement of the two variables, as well as on the meaning that a change in each unit has in the scheme of things. There’s a lot going on in a slope! The advantage of unstandardized slopes is that they keep your theorizing grounded in context, and the predictions you make are in the actual units on which the outcome variable is measured. If you want to stay more grounded in your measures and what they actually mean, then the unstandardized coefficient estimate is useful. Regression models give you predictions you can understand.

However, the advantage of the standardized effect estimate r is that it gives you a sense of the strength of the relationship regardless of the units of the variables. If you want to be able to compare effect estimates to those generated by other models in other studies, use the standardized coefficient or correlation value.

12.10 Regression limitations¶

Regression and correlation are powerful tools for modeling relationships between variables. But each must be used thoughtfully, always interpreting the findings in context and using all the other knowledge you have about the context.

Correlation does not imply causation¶

Most important to bear in mind is that correlation does not imply causation, something you no doubt have heard before. Just the fact that two variables are correlated does not necessarily mean we understand something about what created either of them. In this sense, regression is no different from correlation.

There are many examples of this: children’s shoe size is correlated with scores on an achievement test; the tendency to wear less clothing is correlated with higher temperatures. In the case of shoe size, we can see that the correlation is spurious; age is a confounding variable, causing increases in both shoe size and achievement. In the case of skimpy clothing, the relationship is real, but the causal direction must be sensibly interpreted. Hiking up the temperature might indeed cause people to shed their clothing. But taking off clothes is not going to cause the temperatures to go up, no matter how many times you try.

Disambiguating causal relationships and controlling for possible confounds is not achievable through statistical analysis alone. Statistics can help, and correlation can certainly provide evidence in support of a causal theory. But research design is a necessary tool. Random assignment or information about the temporal order of effect and outcome are required to figure out whether a particular relationship is causal or not.

Are all regressions straight?¶

Another thing to point out is that the models we have considered in this chapter are linear models. We fit a straight, linear line to a scatter of points, and then look to see how well it fits by measuring residuals around the regression line.

But sometimes a straight line is just not going to be a very good model of the relationship between two variables.

Take this graph from a study of the relationship of body weight to risk of death (from McGee DL, 2005, Ann Epidemiol 15:87 and Adams KF, 2006, N Engl J Med 355:763). Being underweight and being overweight both increase the risk of death, whereas being in the middle reduces that risk.

If you ignored the shape of the relationship and overlaid a straight regression line, the line would probably be close to flat, indicating no relationship. But if you did that you would be missing an important curvilinear relationship. This is like the opposite of "correlation does not imply causation" - causation doesn't necessarily imply linear correlation!

Before fitting a linear regression model, look at the scatterplot and see if a linear association makes sense. If it doesn’t, well... take an advanced stats class that teaches you about different models! Statisticians have lots of models to offer beyond just the simple straight line.

Do regression lines go on forever?¶

Finally, there is the problem of extrapolation. We have already pointed out from our regression of Thumb on Height that, according to the model, someone who is 0 inches tall would have a thumb length of 2.84 millimeters. It doesn't make any sense for someone's height to be 0, though. In addition, it is possible for our model to make a negative prediction, which also doesn't make sense. Obviously, the regression model only works within a certain range, and it is risky to extend that range beyond where you have recorded data. In general, common sense and a careful understanding of research methods must be applied to the interpretation of any statistical model.

Chapter summary¶

After reading this chapter, you should be able to:

Use both categorical and quantitative predictors in linear models
Explain a regression line
Identify deviations in a regression model
Fit a regression model using lm()
Use a regression model to make predictions about new values of data
Compare the fit of a regression model to empty model and two-group model
Calculate a correlation
Describe the limitations of using regression

New concepts¶

regression models: General Linear Models with quantitative predictors and a quantitative outcome.
regression line: The line of best fit through a scatter plot. Represents the prediction on the outcome variable that a regression model would make for each value of a predictor.
correlation coefficient: Represents the strength of linear association between two quantitative variables. Derived from a regression model with standardized variables.

New R functionality¶

Next: Chapter 13 - Multivariable Models