# Run this first so it's ready by the time you need it
install.packages("dplyr")
install.packages("supernova")
install.packages("ggformula")
library(dplyr)
library(supernova)
library(ggformula)
GSS <- read.csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/GSS.csv")
You have now learned to specify models with single explanatory variables, either categorical or quantitative. We saw how adding variables like this can improve prediction accuracy beyond that of the null model, getting us closer to understanding the underlying data generation process.
It stands to reason that if one explanatory variable can help us, maybe more than one can help us even more. Afterall, most real world processes aren't so simple as being created by one input, and in our models in the previous chapter it looked like there was still plenty of unexplained variance to account for. So let's start building even more complex models, with multiple variables - multivariable models or multiple regression.
First, we'll introduce a new dataset GSS
. This is a subset of the General Social Survey, a US-wide yearly survey administered by the University of Chicago to assess national demographics and attitudes. We'll check out the variables that are included in this data file:
str(GSS)
Quite a lot! These variables and data values are named descriptively so we can tell what they represent. We can see that there are a number of details about a person (their education, home situation, etc.) as well as what they think about various policy issues (marijuana, transportation, etc.)
Let's say we want to predict how many years of education someone is likely to have attained (highest_year_of_school_completed
). Stop and think for a second what variables are likely to explain variance in number of years of schooling someone did?
One factor might be how many years of schooling the person's father had. After all, people's value of and readiness for higher education can be influenced by how much their parents valued or emphasized education while growing up.
Another factor might be number of brothers and sisters. Children in large families sometimes need to take time to help care for their siblings, which sacrifices time for school.
Both of these are potential explanatory variables for number of years of schooling. If we were to make a conceptual model for years of school someone completed, we might draw both of these as inputs to the outcome variable highest_year_of_school_completed
.
Complete the code below to make two different models, one with each of these variables. For ease of visualizing, we will make a smaller subset of this data with dplyr's sample_n()
function and we will also remove datapoints with NA
on any of the three variables.
set.seed(10)
GSS_subset <- sample_n(GSS, 100) #args are dataframe to sample from, and number of observations to sample
#removing NA observations
#does it make sense what this filtering code is doing? Check out chapter 4 for a refresher
GSS_subset <- filter(GSS_subset, !is.na(highest_year_of_school_completed) &
!is.na(highest_year_school_completed_father) &
!is.na(number_of_brothers_and_sisters))
#Make a model that predicts highest years of schooling from father's years of education.
#Look at the variable list above to get exact variable names and use GSS_subset as the data.
father_model <- lm(#YOUR CODE HERE, data = GSS_subset)
#Make a model that predicts highest years of schooling from number of siblings
sibling_model <- lm(#YOUR CODE HERE, data = GSS_subset)
#Plotting father model
gf_jitter(highest_year_of_school_completed ~ highest_year_school_completed_father, data=GSS_subset) %>%
gf_lm()
#Plotting sibling_model
gf_jitter(highest_year_of_school_completed ~ number_of_brothers_and_sisters, data=GSS_subset) %>%
gf_lm()
Both single-predictor models appear to explain some of the variation in education; knowing how many years someone's father was in school helps us to make a better prediction of their own time in school, as does knowing how many siblings they have. The best-fitting regression line in these plots is not flat. Neither model, however, explains all the variation in education. There is still plenty of unexplained error.
We could just choose the single-predictor model that works best. Write some code to make the ANOVA tables from these two models to see which one explains the most variation in years of education.
# Generate the ANOVA tables for both father_model and sibling_model
The better model would be based on father's education. Compared with the empty model, father_model
resulted in a PRE (Proportional Reduction in Error) of 0.2897 compared with a PRE of 0.1209 for sibling_model
. More error would be reduced because the predictions from father_model
are more accurate.
But is it possible that we could get an even higher PRE by including both predictors in the model? Another way of asking this question is: could some of the error leftover after fitting father_model
be further reduced by adding number_of_brothers_and_sisters
into the same model? If we knew both how much schooling someone's father had and how many siblings they have, could we make a better prediction of their own education than if we only knew one or the other explanatory variable?
We could represent this idea like this:
$${Education} = C_1{FatherEducation} + C_2{Siblings} + {error}$$Let’s explore this idea with some visualizations. We will start with a graph of father_model
, plotting highest_year_of_school_completed
by highest_year_school_completed_father
. We will then explore some ways we could visualize the effect that number_of_brothers_and_sisters
has above and beyond that of highest_year_school_completed_father
.
#Plotting father model
gf_jitter(highest_year_of_school_completed ~ highest_year_school_completed_father, data=GSS_subset,
size=4)
Adding more variables to this graph can be tricky. It's two-dimensional, so it only have axes for two variables - one predictor and one outcome. One approach to adding a third variable is to assign different colors to points representing people with different number of siblings. You can do this by adding color = ~number_of_brothers_and_sisters
to the scatter plot. (Remember that the ~
tilde tells R that color should vary by number_of_brothers_and_sisters
) Try it in the code block below.
#Add in the color argument to the gf_jitter() function
gf_jitter(highest_year_of_school_completed ~ highest_year_school_completed_father, data=GSS_subset,
size = 4) %>% gf_lm()
Adding the regression line makes it easier to see the error (or residuals) leftover from the father_model
. Notice that more of the light-colored dots (people with a lot of siblings) are mostly below the regression line (i.e., with negative residuals from father_model
) while the darker dots (people with fewer siblings) are mostly above the line (positive residuals). This indicates that people with fewer siblings generally have more years of education than what father_model
would predict, while people with more siblings generally have fewer years of education.
If that's a little tough to see, another way of plotting this is to calculate the residuals from father_model
, and make a separate scatter plot with those values to see how they relate to number_of_brothers_and_sisters
.
#Save the residuals of father_model to a residual variable
GSS_subset$father_resid <- #YOUR CODE HERE
#plotting these residuals against number of siblings
gf_jitter(father_resid ~ number_of_brothers_and_sisters, data=GSS_subset,
size = 4)
In this plot, we can see father_model
tends to undershoot the prediction of education years (positive residual value) when someone has fewer siblings, and overshoots the prediction (negative residual value) when someone has many siblings.
These patterns are a clue that adding number_of_brothers_and_sisters
into father_model
will explain additional variation in highest_year_of_school_completed
above and beyond that explained by just their father's education alone.
We can see from visualizations of the data that a model that includes both highest_year_school_completed_father
and number_of_brothers_and_sisters
might help us make better predictions of highest_year_of_school_completed
than would a model including only one of these variables.
In the case of a one-variable model, we learned how to write an equation that represents this statistical model:
$$ Y_i = b_0 + b_1X_i + e_i $$This says that there is some parameter, the intercept, to which we can add the value of the predictor variable (multiplied by a coefficient to represent the size of its effect) in order to make predictions about the value of Y. These components of the model are combined linearly - they're added together. That's where the name "general linear model" comes from.
If we want to add in another variable, we essentially want to add in the effect it has for predicting Y. That's what makes the general linear model framework really powerful - we can just add this new variable as a new component in the model! Building on the notation we used for the one-variable model, we will specify the two-variable model like this:
$$ Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i $$Although it may look more complicated, on closer examination you can see that it is similar to the single-variable model in most ways. $Y_i$ still represents the outcome variable, and $e_i$ at the end still represents each data point’s error from the model prediction. And, it still follows the basic structure: DATA = MODEL + ERROR.
Let’s unpack the MODEL part of the equation. Whereas previously we had only one X in the model, we now have two ($X_{1i}$ and $X_{2i}$). Each X represents a predictor variable. Because it varies across observations it has the subscript i. To distinguish one X from the other, we label one with the subscript 1, the other with 2. In our case the first of these will represent highest_year_school_completed_father
, the second, number_of_brothers_and_sisters
, though which X we assign to which variable doesn’t really matter (just make sure you remember which you assigned to which).
Notice, also, that with the additional $X_{2i}$ we also add a new coefficient or parameter estimate: $b_2$. We said before that the empty model is a one-parameter model because we are estimating only one parameter, $b_0$. A single-predictor model (e.g., father_model
) is a two-parameter model: it has both $b_0$ and $b_1$. This multivariable model is a three-parameter model: $b_0$, $b_1$, and $b_2$.
Having specified the skeletal structure of the model, we next want to fit the model, which means finding the best fitting parameter estimates (i.e., the values of $b_0$, $b_1$, and $b_2$). By “best fitting” we mean the parameter estimates that reduce error as much as possible around the model predictions.
Although there are several mathematical ways to do this, you can imagine the computer trying every possible combination of three numbers to find the set that results in the lowest Sum of Squares (SS) Error.
To specify the formula to give R to use, remember that the formula dictates the kind of relationship all the variables have. Since we're thinking highest_year_of_school_completed
varies as a function of both highest_year_school_completed_father
and number_of_brothers_and_sisters
, our formula looks like:
highest_year_of_school_completed ~ highest_year_school_completed_father + number_of_brothers_and_sisters
Use this formula to fit the model with lm()
:
# use lm() to find the best fitting coefficients for our multivariate model and output the coefficient values
full_model <- #YOUR CODE HERE
full_model
In some ways, this output looks familiar to us. Let’s try to figure out what these parameter estimates mean. First off, there are now three coefficients. This makes sense, because we already know to expect a value for $b_0$, $b_1$, and $b_2$. We can also see which variable each coefficients corresponds to, since R labels them for us. Using this output, we can write our best fitting model in GLM notation as:
$$ \hat{Y_i} = 10.3742 + 0.3424X_{1i} - 0.1980X_{2i} $$We use the parameter estimates to make predictions in the same way as we did before, but this time we adjust our prediction based on two variables: how much education their father received, and the number of siblings they have. Now let’s try to understand how the variables in the multivariable model are coded in order to generate predictions.
The equation form of the statistical model generates a prediction of someone's years of education by starting with the intercept ($b_0$, which is 10.3742), then adding 0.3424 years for each year of their father's education, and subtracting 0.1980 for each sibling they have. This means that for someone who's father received no formal schooling and who had no siblings ($X_{1i}$=0 and $X_{2i}$=0), we'd predict that they'd get to about the 10th grade. For every year their father had schooling, we'd predict an additional third of a year, but for every additional sibling they had, we'd predict a fifth of a year less.
How should we talk about the meaning of these parameters then, to describe these predictions? To build this understanding, let's do some funky visualizations for a second. First, we will save the predictions of full_model
to be stored in GSS_subset
.
GSS_subset$education_predicted <- predict(full_model)
And now the funky part. We will plot these predictions against highest_year_school_completed_father
, but ONLY for datapoints that had 0 or 7 siblings:
subsetsubset <- filter(GSS_subset, number_of_brothers_and_sisters == 0 |
number_of_brothers_and_sisters == 7)
gf_point(education_predicted ~ highest_year_school_completed_father,
data = subsetsubset, color = "red", size = 5)
There's not a lot of data left here, but from what there is, it kind of looks like these points are arranged on two separate diagonal lines. The picture below has these lines superimposed on the graph:
This implies that we can actually write two separate equation models for people who have no siblings, and people who have 7 siblings. If we start with the fitted full_model
:
For someone with no siblings, we'd use 0 as the value of $X_2i$, which would look like:
$$ \hat{Y_i} = 10.3742 + 0.3424X_{1i} - 0.1980*0 $$multiplying the $b_2$ coefficient by 0 makes it drop out of the equation, so the model for someone with no siblings is:
$$ \hat{Y_i} = 10.3742 + 0.3424X_{1i}$$For people with 7 siblings however, the second model term does not drop out:
$$ \hat{Y_i} = 10.3742 + 0.3424X_{1i} - 0.1980*7 $$-0.1980 times 7 is -1.386, so this equation could also be written as:
$$ \hat{Y_i} = 10.3742 + 0.3424X_{1i} - 1.386 $$We could combine this value with the intercept (10.3742 - 1.386) to make a model for someone with 7 siblings:
$$ \hat{Y_i} = 8.9882 + 0.3424X_{1i}$$Both of these equations – one for 0 siblings and the other for 7 siblings – represent straight lines. Both have a slope and an intercept. These two lines have the same slopes (which is why they appear parallel) but different y-intercepts (10.3742 versus 8.9882). These different intercepts were calculated by $b_0$ + $b_2X_{2i}$, inserting either 0 or 7 as the value of $X_{2i}$. Thus, even though the multivariable model just looks like one long equation, it contains within it separate regression equations for each value of sibling number.
To interpret $b_1$ then, it is the effect of $X_1$ when $X_2$ is held constant.
Because the order of variables in the multivariable equation doesn't actually matter, we can interpret $b_2$ the same way. It is the effect of $X_2$ when $X_1$ is held constant.
In statistics we often use the phrase "over and above" to describe this. $b_1$ is the effect of $X_{1i}$ over and above the effect of $X_{2i}$. $b_2$ is the effect of $X_2$ over and above the effect of $X_1$.
$b_0$ is still the intercept (predicted outcome value when both $X_1$ and $X_2$ are 0).
Because of the partial prediction nature of these parameters - $b_1$ representing the effect of $X_1$ when $X_2$ is held constant - we call these partial regression coefficients. It only makes sense to interpret them in the context of the other coefficients. Notably, if we were to create a model with a different predictor for $X_2$ or even adding a third predictor, these coefficients would be different.
In most respects, concepts developed for the single-predictor models will apply to the multi-predictor models. In all cases, the model generates a predicted value on the outcome variable for each observation in the data frame. Subtracting the model prediction from the observed value will give us a residual, which tells us how far off the model prediction is (positive or negative) for each observation.
If we square and then total up all the residuals we will get the $SS_{error}$ for the model, which gives us a sense of how well the model fits the data. Using this $SS_{error}$, we can then compare the multivariable model to other models, starting with the null model. To assess how well a model fits the data we will continually ask: How much does one model reduce error over another?
To begin to answer this question, let’s start by comparing the sum of squared error from our new model to the error from the null model.
We previously used the supernova()
function to generate ANOVA tables that contain the SSs useful for comparing models. In the code block below, add code to generate the supernova()
output for full_model
.
# generate the ANOVA table for full_model
You may notice right away that this ANOVA table has more rows than the one for either father_model
or sibling_model
that we saw earlier. Don’t worry about these new rows for now – just look for SS Total, SS Error, and SS Model; these have the same meaning as in the single-predictor models.
As before, $SS_{total} = SS_{model} + SS_{error}$:
182.376 + 388.286
$SS_{total}$ (the bottom row of the ANOVA table) tells us how much total variation, measured in sum of squares, there is in the outcome variable. You can see that $SS_{total}$ is 570.662. $SS_{total}$ is all about the outcome variable, in this case highest_year_of_school_completed
. It is based on squaring and then summing residuals from the null model, where the prediction of everyone's score is the mean of highest_year_of_school_completed
. No matter which predictor variables you add to your model, $SS_{total}$, the last row in the ANOVA table, is always the same as long as the outcome variable is the same. The null model of an outcome variable does not depend on any predictor variables.
$SS_{error}$ is the generic name we give to the sum of the squared residuals leftover after fitting a complex model (by “complex” we just mean a model that is more complex than the null model). Because $SS_{total}$ = $SS_{model}$ + $SS_{error}$, the lower $SS_{error}$ is, the higher $SS_{model}$ will be, meaning that more of the variation has been explained by the model, which is the same as saying that more of the error has been reduced by the model. We can apply the concepts of $SS_{model}$ and $SS_{error}$ to any model, from those with just a single predictor all the way to those with many predictors.
Looking at PRE in the Model line, it looks like this multivariable model explains about 32% of the variance in highest_year_of_school_completed.
That's more than either of the single-predictor models explained, so it looks like it was a good idea to include both predictors in one model together!
Earlier we fit the coefficients of the full multivariable model as:
$$ \hat{Y_i} = 10.3742 + 0.3424X_{1i} - 0.1980X_{2i} $$How do the actual values of these coefficients compare to those generated by one-predictor models?
father_model
sibling_model
This does not give us the same coefficient values as the full multivariable model. father_model
has an intercept $b_0$ value of 9.2011, sibling_model
has an intercept of 15.095, and the multivariable model calculated 10.3742. There are also differences in the coefficient for the effect of father's education and number of siblings, as well. Why is that, when we're using the same variables to build the model?
This phenomenon happens because of covariance between the two predictor variables. These two variables themselves are related to each other, as we can see in this scatter plot of father's education and number of siblings:
gf_jitter(highest_year_school_completed_father ~ number_of_brothers_and_sisters, data=GSS_subset,
size = 4)
We can also see this by investigating the correlation between the variables:
cor(GSS_subset$highest_year_school_completed_father, GSS_subset$number_of_brothers_and_sisters)
According to this, fathers with less education tend to have more children.
When two predictor variables are related to each other, there is less unique information expressed by these separate variables. If you already know how many years of education someone's father has, you're likely to make better guesses about the number of siblings they have by using that information than just guessing randomly.
This means that the utility of adding the second variable number_of_brothers_and_sisters
to the multivariable model doesn't explain as much variance in highest_year_of_school_completed
as we might initially think. Given that highest_year_school_completed_father
and number_of_brothers_and_sisters
have overlapping information, some of the variance in highest_year_of_school_completed
that number_of_brothers_and_sisters
can explain is already explained by highest_year_school_completed_father
. We can see this fact when we create an ANOVA table for the multivariable model:
supernova(full_model)
PRE on on the Model line is 0.3196, indicating that the full model explains 31.96% of the variance in highest_year_of_school_completed
. Before, we calculated PRE of just father_model
as 0.2897 and PRE of just sibling_model
as 0.1209. While 0.3196 is bigger than either of those alone, it is less than 0.2897 + 0.1209. The two variables don't explain completely separate amounts of variance in the outcome variable.
Venn diagrams are a useful way to help us understand sums of squares for multivariable models. The null model of highest_year_of_school_completed
can be represented by a single circle, shown at the left in the figure below. This is the $SS_{total}$. When we add a predictor variable into the model (e.g.,highest_year_school_completed_father
, as shown on the right), it reduces some of that total error. This reduction in error (i.e., $SS_{model}$) is represented by the overlap between the two circles, shown with horizontal stripes.
Now let’s visualize what happens when we add number_of_brothers_and_sisters
to the model. Adding this variable reduces more error, beyond that already reduced by highest_year_school_completed_father
.
As $SS_{model}$ gets larger, $SS_{error}$ gets smaller. $SS_{model}$, the amount of variation explained by the model (represented as the three regions with stripes), is larger for the multivariable model than for the single-predictor model and $SS_{error}$ is smaller.
In the Venn diagram below, we have labeled the three regions of the area with horizontal stripes as A, B, and C. $SS_{model}$, the error reduced by the multivariable model, is represented by the combined area of regions A, B, and C. Some of the error is uniquely reduced by number_of_brothers_and_sisters
(region A), some uniquely reduced by highest_year_school_completed_father
(region C), and some reduced by both (region B)!
Region B exists because father's education and number of siblings are related. Parameter estimates in the model thus have to be adjusted to account for this fact and enable correct calculation of Y. Also, the sum of squares must be adjusted. The $SS_{model}$ for the multivariate model cannot be found by simply adding together the $SS_{model}$ numbers that result from fitting from the two single-predictor models separately (father_model
and sibling_model
). If you added them separately you would be counting region B twice, and thus would overestimate $SS_{model}$ for the multivariable model. For the same reason, we need to look at PRE for the full model to judge how much total variation in highest_year_of_school_completed
we can account for with both highest_year_school_completed_father
and number_of_brothers_and_sisters
together.
Bringing this back to the numbers in our data, look again at the ANOVA table for full_model
:
supernova(full_model)
The full variation to be explained in the outcome variable is expressed by $SS_{total}$. How much the full model reduces that error is expressed by $SS_{model}$. But there are SS values for each individual predictor as well. These stand for the striped areas in the Venn diagram for just those predictors. In other words, this value is the error that is explained uniquely by each predictor.
The column for PRE can be broken down similarly. On the Model line, the PRE score 0.3196 says that the full model explains 31.96% of the variation in highest_year_of_school_completed
. Underneath that, there is a PRE score for each predictor. This is the proportion of error uniquely explained by each predictor. In this way, we can see the relative importance of each predictor in the model - father's education uniquely explains 22.6% of the variation, while number of siblings explains an additional 4.2%. These numbers don't add up to 31.96. That means whatever is not uniquely explained by an individual predictor is explained jointly by both. In this case, 5.16% of the error is jointly explained by father's education and number of siblings because 0.3196 - 0.226 - 0.042 = 0.0516.
The multivariate model we built above is constructed with two continuous variables, highest_year_school_completed_father
and number_of_brothers_and_sisters
. But when using the general linear model, we aren't limited to just this data type. It's flexible that way.
Let's say we instead want to model highest_year_of_school_completed
with the variables respondents_sex
and born_in_us
. Use str()
to check out the data types of these variables in particular:
#resetting GSS_subset, since we deleted some rows earlier
set.seed(10)
GSS_subset <- sample_n(GSS, 100)
#check variable type of respondents_sex
str(GSS_subset$respondents_sex)
#check variable type of born_in_us
str(GSS_subset$born_in_us)
This reveals that they are both character variables instead of numeric. Now use table()
to see how many unique levels of each variable there are, in GSS_subset
:
#table of respondents_sex values
table(GSS_subset$respondents_sex)
#table of born_in_us values
table(GSS_subset$born_in_us)
This tells us that there are two levels to each variable, and that there are no missing values in respondents_sex
(51 + 49 = 100) but there is one missing value in born_in_us
(14 + 85 = 99). Note that while we manually filtered out missing data from the analysis earlier in this chapter, we don't actually have to do that when using lm() - it'll leave out missing values automatically. It's one of those R functions that can handle NAs. But if you want to save predictions and residuals to the same data frame, it'll only make predictions/residuals for the existing data. This means that the vector produced by predict()
or resid()
may be shorter than the data frame, and you'll need to filter the data frame to be the same length if you want to save those vectors as variables in the data frame.
Now use the code box below to make a multivariable model with both respondents_sex
and born_in_us
predicting highest_year_of_school_completed
:
#Write some code that will output the results of a model with two categorical predictors
This is still a model with three parameters - $b_0$, $b_1$, and $b_2$ - so the output of the model still gives you three coefficient estimates. You just have to be careful about interpreting the meaning of these coefficients given that you're now using categorical variables as predictors.
As we covered in chapter 11, when you include a categorical variable with two levels in a statistical model, R will automatically recode this variable as 0 and 1, and choose one level to be the reference group. The same happens in the multivariable case, for both predictors. You can predict values of highest_year_of_school_completed
by plugging either 0 or 1 into the X values of the equation. Thus the effect of $b_1$ is the change in predicted Y due to the one-unit increase in $X_1$ (going from the reference group to the non-reference group) when holding $X_2$ constant, and the effect of $b_2$ is the change in predicted Y due to the one-unit increase in $X_2$ when holding $X_1$ constant. $b_0$ is the predicted value of Y when both predictor variables are set to 0.
You may have wondered by now, why we've only worked with categorical predictor variables that have two levels - Sex
in the studentdata
dataset, respondents_sex
and born_in_us
here, etc. What happens when you want to use a categorical predictor variable with more than two levels? R automatically recodes two-level categorical variables into 0 and 1 when using them as predictor variables, so what does it do with more than two levels?
The answer is that R actually makes this case into a multivariable model - even when the formula we use only specifies one input variable. Let's test this out by predicting highest_year_of_school_completed
with only the variable race_of_respondent
in order to ask if there are differences in this sample's average education level between different racial groups.
Use table()
again with GSS_subset
to see how many levels this variable has:
#table of race_of_respondent values
There are 70 white participants, 20 black participants, and 10 participants of different racial groups that the survey administrators classified as "Other". Thus this variable can take on three values - White, Black, or Other. We can't represent this with just 0s and 1s since there are three unique values... or can we?
Use lm()
to predict highest_year_of_school_completed
with race_of_respondent
and see what happens:
#model predicting education years with race of respondent
There's only one input variable, but we still get three parameter estimates!
This is because R still wants to stick with dummy-coding categorical variables - giving levels either the value of 0 or 1. This helps the math work out. But in order to use only 0s and 1s for a variable with three levels, it needs to split this information across two different Boolean variables. Thus this model turns into the form:
$$ Y_i = b_0 + b_1X_{1i} + b_2X_{2i} + e_i $$Where $X_1$ is whether or not someone is in the "Other" race category (1 for yes, 0 for no), and $X_2$ is whether or not someone is in the "White" race category (1 for yes, 0 for no). "Black" is being used as the reference group for both, since it is first in the alphabet (and you can verify this by looking at the names of the coefficients output by the model).
In this case, the meaning of $b_1$ is the change in predicted Y due to being in the category "Other" compared to the reference group, and $b_2$ is the change in predicted Y due to being in the category "White" compared to the reference group. $b_0$ is the predicted level of education for someone in the category "Black" (since that means $X_1$ and $X_2$ are both 0, or someone is neither Other or White).
Again, you don't need to do the dummy-coding yourself - R will automatically set up the model for you this way when you pass it categorical variables as predictors. Just make sure your predictor data types truly are categorical (as the character or factor data type) when you want it to behave this way.
When the model is automatically built out of a multi-level single variable like race_of_respondent
, it's not possible for a datapoint to have an $X_1$ of 1 and an $X_2$ of 1. That would mean they are both "Other" and "White", and this variable is set up such that someone can only be one category. But you could imagine allowing participants to select multiple cateogries for themselves, in which case you could have a value of 1 on multiple categorical variables.
We haven't talked much about ordinal data in a while, but it is a common way of representing variables - Likert scales (e.g, ratings from 1 to 5) are used very frequently in psychology. In this very dataset there are multiple instances of ordinal variables.
For the purposes of regression, they're a bit of a hybrid between categorical predictors and interval predictors. They're ordered, such that a one-unit increase in this variable means you're getting "bigger" or "larger" on the variable, like with an interval variable. But there's only a few response options, so predicting an outcome value for X = 4.3 doesn't make sense when participants only have the option of responding 1, 2, 3, 4, or 5 (i.e., 4.3 is not a real possible input).
So how should you interpret the value of a coefficient when using an ordinal variable as a predictor? The answer is maybe unsatisfactory, but it depends on how you want to interpret it. In other words, do you care about the effect of having a particular value on the scale? Or do you care about the effect of just moving up on the scale, no matter the specific scale value?
In the first case, modeling it as a categorical variable makes the most sense. Let's do this in the case of predicting the number of children someone has based on how happy they say they are, and include a boxplot to help with visualizing:
#number_of_children was saves as a character type because one response option is "8 or more" -
#always check your data types before modeling!
GSS_subset$number_of_children <- as.numeric(GSS_subset$number_of_children)
lm(number_of_children ~ general_happiness, data=GSS_subset)
gf_boxplot(number_of_children ~ general_happiness, data=GSS_subset, color= ~general_happiness)
general_happiness
has 3 levels, "Very happy", "Pretty happy", and "Not too happy". Thus when modeling it as a categorical variable, there are 3 parameter estimates - one for the intercept, one for Pretty happy vs. Not too happy, and one for Very happy vs. Not too happy (Not too happy is being used as the reference group).
Each of these parameters gives you the power to see what is the effect of being in one level of the ordinal variable vs. the reference level. Based on this, it looks like people who are both pretty happy and very happy tend to have fewer children than people who are not too happy. However, doing it this way doesn't preserve any information about the order of the levels. Any of them could be used as a reference group, and it's hard to figure out if there's a general trend upward or downward just based on the coefficients. In other words, there's no statistical comparison between the levels of pretty happy and very happy.
Alternatively, we could recode general_happiness
into numeric labels and use that as a singular predictor:
#converting general_happiness to new labels, and as numeric
GSS_subset$happiness_num <- as.numeric(recode(GSS_subset$general_happiness, "Not too happy" = "0",
"Pretty happy" = "1",
"Very happy" = "2"))
lm(number_of_children ~ happiness_num, data=GSS_subset)
gf_jitter(number_of_children ~ happiness_num, data=GSS_subset, width=0.2, height=0.2) %>% gf_lm()
Now there are just two coefficients, the intercept $b_0$ and the effect of a one-unit increase on happiness_num
$b_1$. This has the advantage of a more parsimonious model, and allows you to describe how predictions of Y change as you move up the ordinal scale. However, this means the effect is interpreted to be the same between each level of the predictor, and enables you to make predictions for values between possible responses (e.g., X = 1.5) which doesn't make sense. When modeled this way, it looks like there isn't much of an effect at all of happiness on how many children people have.
There's also a difference in proportional reduction of error between these different model versions:
categ_model <- lm(number_of_children ~ general_happiness, data=GSS_subset)
continuous_model <- lm(number_of_children ~ happiness_num, data=GSS_subset)
supernova(categ_model)
supernova(continuous_model)
This implies that maybe the effect of happiness on number of children isn't consistent between levels, and we should treat this variable as categorical. But maybe you have other reasons for modeling it as continous anyways (e.g., you have a different dataset where someone could rate their happiness level on a continuous scale between 1 and 100 and you want to know how this model would work in that dataset).
Whichever method you choose for ordinal predictors will depend on how you want to use your statistical model. But it will have consequences on how you interpret the coefficients and how well you make predictions, so think deeply about your reasons before you start modeling. In general, analysts tend to treat ordinal variables with 5+ levels as continuous, and ordinal variables with 4 or fewer levels as categorical. But there are also exceptions, and this is something that most analysts don't actually put a lot of thought into. Try to be better than that!
Due to the flexibility of the general linear model, it is also possible to combine categorical and continuous predictors into one model. Let's predict highest_year_of_school_completed
with highest_year_school_completed_father
and born_in_us
:
lm(highest_year_of_school_completed ~ highest_year_school_completed_father + born_in_us, data = GSS_subset)
We still get fitted parameter estimates, just as we have been doing all throughout this chapter. In this case, remember to interpret them in the context of each other's variable types. $b_1$ is the change in prediction of someone's years of education due to a one-unit increase in their father's education level, over and above the effect of being born in the US. $b_2$ is the change in prediction of someone's years of education due to being born in the US vs. not, over and above the effect of father's education. $b_0$ is someone's expected education when both $X_1=0$ (their father had no years of formal education) and $X_2=0$ (they were not born in the US).
Since the general linear model allows us to account for an additional predictor via addition, it assuredly allows us to add even more than two.
When prediction is our main goal, adding as many predictors as possible can be helpful. In general, the more parameters we add to a model, the less leftover error there is after subtracting out the model. Because we have said, many times, that the goal of the statistician is to reduce error, this seems like a good thing. And it is, but only to a point.
Let’s do a little thought experiment. You know already that the three-parameter model full_model
explained more variation than either of the two-parameter models father_model
or sibling_model
. Something with four parameters would explain more than the three-parameter model. And so on. What would happen if we kept adding variables until there were as many predictors as datapoints?
In this case, the model error would be reduced to 0. Why? Because each person would have their own parameter in the model. If each person had their own parameter, then the predicted score for that person would just be the person’s actual score. And there would be no residual between the predicted and actual score. The problem with this is that even though the model fits our current data perfectly, it would not fit if we were to choose another sample. Getting better at making in-sample predictions, but getting worse at making out-of-sample predictions, is called overfitting.
Although we can improve model fit by adding parameters to a model, there is always a trade-off involved between reducing error (by adding more parameters to a model) on one hand, and increasing the intelligibility, simplicity, and elegance of a model on the other.
This is a limitation of PRE as a measure of our success. If we get a PRE of .40, for example, that would be quite an accomplishment if we had only added a single parameter to the model. But if we had achieved that level by adding 30 parameters to the model, it’s just not as impressive.
There is a quote attributed to Einstein that sums up things pretty well: “Everything should be made as simple as possible, but not simpler.” A certain amount of complexity is required in our models just because of complexity in the world. But if we can simplify our model so as to help us make sense of complexity, and make predictions that are “good enough,” that is a good thing.
When one's goal is to make accurate predictions about an outcome variable, adding multiple predictors to a regression equation buys us more variance explained and reduces error. In this case it doesn't really matter how much variance each predictor individually accounts for, you're just interested in the explanatory power of the entire model. This is a common goal in data applications like business analytics or machine learning.
But as psychologists, often we are curious about specific predictors and how they relate to an outcome variable. I.e, we want to know how one's parents' education level relates to a person's own education level. This doesn't imply that we think only parental education influences own education - we can recognize the whole system is complex. But we want to characterize that particular relationship and understand that specific part of the data generation process.
Under this goal multiple regression is still done, but with specific reasons. In the prediction use case, you add as many predictors as you reasonably can to the model to try to maximize both your in-sample and out-of-sample prediction accuracy. In the use case for understanding specific predictors, you want to make sure that there aren't other variables that better explain any association you find between your predictor of interest and your outcome. As we saw previously, predictors that are correlated with each other share some variance in the outcome variable. So if you think some variable X is associated with an outcome Y, you want to make sure this association isn't just because of some other variable C that causes both X and Y. In this case, C is what's known as a confound - a variable not of interest, but which is a part of the data generation process and influences how you can interpret the effect of X on Y.
In ideal situations, we can deal with this via our research methodology. Say we want to know whether father's education is associated with own education, but we're worried that a third variable of mother's education might better explain both. Maybe father's education is only related to own education because mothers of a certain education tend to marry people of the same level, but mothers have a stronger influence on their child's education than fathers do. To best deal with this, we'd want to randomly assign mothers of different education levels to each family to grow up with in order to break the association between mothers' and fathers' education, allowing both to explain unique amounts of variance in the outcome variable of one's own education.
But it's not always possible or ethical to do random assignment. In that case, our back up plan is to use statistical control. This is the process of including possible confound variables in the regression model so that they have a chance to explain some of the variance in the outcome variable. This way, you can interpret the coefficient of the variable you actually care about as its unique contribution to the outcome variable, controlling for the influence of the confound.
Consider this in the graph below. In the left panel, it looks like there is an association between X and Y. Color of the dots represents C. We suspect that C is correlated with both X and Y and might cause both of them. Thus before we can be sure there truly is an association between X and Y, we need to control for C (look at the association between X and Y at each level of C separately; when C is held constant). The colored lines in the middle panel show what the regression line between X and Y looks like at each level of C. In this hypothetical example, when taking C into account, the relationship between X and Y disappears. They only seemed related at first because C was the cause of both of them.
Let's try this out ourselves in GSS_subset
. First, we'll take a look again at the results of a model just predicting own education with father's education:
supernova(father_model)
It looks like father's education explains 34.48% of the variance in own education, such that for every 1-year increase in father's education, there is an expected 0.3882 year-increase in own education. But could mother's education explain this association? First we can check whether it is related to father's education and to a participants' own education, each separately:
cor(GSS_subset$highest_year_school_completed_father, GSS_subset$highest_year_school_completed_mother,
use="complete.obs") #dealing with NAs
cor(GSS_subset$highest_year_of_school_completed, GSS_subset$highest_year_school_completed_mother,
use="complete.obs")
Mother's education is quite correlated with both, so there's a real possibility that it actually accounts for the relationship between father's education and own education! Perhaps the variance that father's education explains is actually explained by mother's education. So, let's build a model to examine the effect of father's education, controlling for mother's education:
bothparents_model <- lm(highest_year_of_school_completed ~ highest_year_school_completed_father +
highest_year_school_completed_mother, data = GSS_subset)
bothparents_model
supernova(bothparents_model)
The coefficient for the effect of father's education decreased in this multivariable model compared to father_model
, because mother's education explained some of the variance father's education was previously explaining. But there's still plenty of unique variance accounted for by father's education, so it has its own unique effect! In fact, it looks like mother's education has very little of it's own unique effect - the SS and PRE score for mother's education is smaller than that for father's education. Most of the reason mother's education is correlated with the respondent's education is because it is also correlated with father's education. Visualized as a Venn diagram, this situation might look something like this:
Statistical control is a powerful tool, but with great power comes great responsibility. There are a couple of ways adding control variables without much thought can actually create problems for your interpretation of a particular variable of interest. At the end of chapter 14 we'll cover how to choose the best model responsibly.
Just as t-tests were traditionally used to do similar things to the null model and two-group model, another historical class of statistical tests is called ANOVAs. This is actually where the ANOVA name comes from for the table we've been using.
ANOVAs work by comparing the amount of variation within a subgroup of data to the amount of variance between different subgroups of data. In the example we worked with earlier for predicting highest_year_of_school_completed
with race_of_respondent
, an ANOVA test would compare the differences in education among people within each racial group to the differences in education among people of different racial groups. It then asks how statistically likely it is that data of different groups actually come from the same distribution. These are typically done with categorical predictor variables, but some versions include a mixed-predictor approach. Below is a summary of two common types of ANOVA:
One-way ANOVA: this type of test looks at values of an outcome variable within and between different levels of one explanatory variable. This can be applied to a predictor variable with two groups, three groups, eight groups, etc. An example would be investigating if there are GPA differences among 3 different majors (chemistry, psychology, and music).
ANCOVA: This stands for ANalysis of COVAriance. The foundation of this test is a one-way ANOVA, but a continuous control variable is included (e.g., differences in GPA among different majors, controlling for how often students study).
In Chapter 19 we will learn how to use these tools as well. For now, realize that we can use a version of the general linear model instead. For a one-way ANOVA, we can fit a regression with the formula GPA ~ major
and let R create dichotomous variables for each level of the variable. This would make a statistical equation of the form: $Y_i = b_0 + b_1X_{1i} + b_2X{2i} + e_i$, where $X_1$ is whether or not someone is a psychology major, $X_2$ is whether or not someone is a music major, $b_0$ is the mean of chemistry majors, $b_1$ is the difference in means between chemistry majors and psych majors, and $b_2$ is the difference in means between chemistry majors and music majors.
For an ANCOVA, we can fit the same formula above but with studying as an additional predictor variable: GPA ~ major + studytime
. This would make a statistical equation of the form: $Y_i = b_0 + b_1X_{1i} + b_2X{2i} + b_3X_{3i} + e_i$, where the additional $X_3$ is how much time someone studies and $b_3$ is the increase in GPA due to a 1-unit increase in study time, over and above the effect of major.
ANOVAs were developed first in the history of statistics and they are what most researchers learn (and use in publications). But the general linear model is more flexible, so you should learn to use it first.
After reading this chapter, you should be able to: