# Run this first so it's ready by the time you need it
install.packages("supernova")
install.packages("dplyr")
install.packages("ggformula")
library(supernova)
library(dplyr)
library(ggformula)
GSS <- read.csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/GSS.csv")
#smaller dataset, more similar in size to most psychology studies
set.seed(10)
GSS_subset <- slice_sample(GSS, n=100)
In the previous chapter we learned how to use the sampling distribution of $\beta$ to test the null hypothesis. Using permutation testing, we generated a sampling distribution for a world in which $\beta = 0$ in the data generation process. We used the sampling distribution to calculate the p-value: the probability that the sample b, or a b more extreme than the sample, would have occurred just by chance if the null hypothesis were true. Based on the p-value, and the decision criterion we had set (i.e., $\alpha$ of .05), we decided whether to reject the null hypothesis or not.
Decisions of statistical significance are made about one statistical estimate at a time. Thus, it is a common tool for psychologists who are interested in specific parts of the data generation process. To answer that question they will build a statistical model with that effect included, calculate a p-value, and then interpret the p-value of that effect in particular.
But that's not the only type of research question you might have. Perhaps instead of one particular variable, you're interested in whether a set of multiple variables together are helpful for making predictions about an outcome. Or maybe you want to find out if a moderation or main effect model is a significantly better model of the data generation process than the other. In other words, you want to know if your entire model is significant, not just any one predictor.
As it turns out, the sampling distribution of $\beta$ is just one of many sampling distributions we could construct. Using the same approach we developed for one model coefficient, we could make a sampling distribution of any statistic that we can calculate. That includes estimates of whole model error such as PRE. In this chapter, we will extend our learning about hypothesis testing to help us make comparisons between different models.
At this point, we're going to switch over to using a different name for PRE that is used more commonly - $R^2$. "PRE" is the name used in the supernova()
package we relied on for learning about error in models, but the concept of what proportion of error a model explains is more commonly discussed and written about as $R^2$.
Using summary()
on a model object, we can skip the ANOVA table entirely and find this value:
model_obj <- lm(highest_year_of_school_completed ~ highest_year_school_completed_mother +
highest_year_school_completed_father, data = GSS_subset)
summary(model_obj)
On the second to the last line of this output is an entry called "Multiple R-squared: 0.2857". This stands for the $R^2$ value in a regression model (aka, what we previously knew as PRE). It is also possible to get this value directly by saving the summary()
output to its own object, and then finding the $r.squared
property of the summary object.
model_summary <- summary(model_obj)
model_summary$r.squared
We can easily verify that this is the same value as PRE is in the ANOVA table for the full model:
supernova(model_obj)
As we discussed in the introduction of this chapter, sometimes we have research questions about the performance of an entire model instead of just one predictor. This could be because we want to know how good our entire idea about the data generation process is, with many pieces included and combined in specific ways. Such an approach is frequently used in areas like computational biology or cognitive psychology, where researchers have developed nuanced models of how the human mind processes information (e.g., the drift diffusion model of decision-making). Other times we don't really care about the components of the model themselves, but we want to know if our model makes better predictions than something else. This approach is more common in business analytics, where company budgets are on the line and your job is to justify whether certain company decisions will make meaningful improvements to revenue, customer reach, etc. For example, Amazon greatly boosted their revenue and reduced losses when they started to use predictive modeling to anticipate customer demand.
When assessing statistical models as a whole, rather than individual predictors, the statistic we care about is PRE, aka $R^2$. We don't care so much how strong any one effect is, but altogether we want a model that explains at least some proportion of the variation in an outcome variable. In a world where this model is not helpful, it wouldn't explain any variance - the population $R^2$ value would be 0. Thus, for significance testing entire models, we need to examine the null sampling distribution of $R^2 = 0$.
In order to construct a null sampling distribution of $R^2$, we first need to understand what would cause a model to have $R^2 = 0$. The $R^2$ model reflects the proportion of variation in an outcome variable that a full model explains, relative to the null model. This score can range from 0 to 1. With $R^2 = 1$, there would be no error left in a model - the set of explanatory variables the model uses results in perfect predictions about the outcome values.
The opposite, then, is when $R^2 = 0$. This would mean the full model explains no additional variation relative to the null model. It performs exactly how the null model would for making predictions. What would turn a full model into an null model?
Recall the equation for the null model:
$$ \hat{Y}_i = b_0 $$The only parameter in the model is $b_0$. For a full multivariable model with two predictors, the equation would look like so:
$$ \hat{Y}_i = b_0 + b_1X_{1i} + b_2X_{2i} $$In order to make this match the equation of the empty model, we need to cancel out the extra terms. The situation that would create this is if both $b_1$ and $b_2$ were 0.
$$ \hat{Y}_i = b_0 + (0)X_{1i} + (0)X_{2i} =$$$$ \hat{Y}_i = b_0 $$The null hypothesis for a full model is thus about more than any one parameter. It is the hypothesis that all predictor coefficients are equal to 0. You could write this as:
$$H_0: \beta_1 = \beta_2 = 0 $$Or, since the consequence of all coefficients being 0 are that no additional variation is explained beyond the empty model, you can write the null hypothesis as:
$$H_0: R^2 = 0 $$To create the null sampling distribution of one coefficient, last chapter we used permutation testing to shuffle the order of one predictor variable. That broke the relationship between the predictor and the outcome, making the true value of that predictor's coefficient equal to 0. We can use permutation testing for our full model situation as well. Since our null hypothesis is that all predictor coefficients = 0, we simply shuffle every predictor and fit a model with those shuffled predictors. Then we estimate the $R^2$ value of the permutation test model. Saving this value and repeating many times will let us build a null sampling distribution of $R^2$.
#creating an empty vector of 1000 spots
null_R2 <- vector(length=1000)
#generate 1000 unique samples, saving each R2
for (i in 1:1000) {
GSS_subset$shuffled_mother <- sample(x=GSS_subset$highest_year_school_completed_mother,
size=length(GSS_subset$highest_year_school_completed_mother),
replace=FALSE)
GSS_subset$shuffled_father <- sample(x=GSS_subset$highest_year_school_completed_father,
size=length(GSS_subset$highest_year_school_completed_father),
replace=FALSE)
model <- lm(highest_year_of_school_completed ~ shuffled_mother + shuffled_father, data=GSS_subset)
null_R2[i] <- summary(model)$r.squared
}
R2_df <- data.frame(null_R2)
gf_histogram( ~ null_R2, data=R2_df)
Walk yourself through every line of this permutation test code, and make sure you understand what it's doing and why. We're shuffling both of the predictors of mother's and father's education, using those shuffled predictors to predict the participants' education level, saving the model $R^2$, and repeating that process 1,000 times.
Then, investigate the shape of the $R^2$ null sampling distribution. Interestingly, it has a very different shape than the sampling distribution of $\beta$. In the figure below we put two example sampling distributions side by side for purposes of comparison.
The sampling distribution of $\beta$ has two tails because the effect of a predictor could be positive or could be negative. $R^2$ is different: the full model can explain none of the error compared to the empty model (0), or up to all of the error from the empty model (1). But it cannot explain less than 0 error. Because $R^2$ is a proportion, it has a hard lower bound of 0, and a hard upper bound of 1.
Assuming the null hypothesis is true, the only place an extreme $R^2$ could fall is in the upper tail of the distribution, which is why there is only one tail. An extreme positive effect of parental education or an extreme negative effect of parental education are both the same to $R^2$: both would make a helpful model, and fall in the upper tail of the sampling distribution of $R^2$.
The shape of this sort of one-tailed distribution is called the F distribution. Much in the same way original statisticians had to build a mathematical formula to represent the theoretical shape of a coefficient's sampling distribution (the t distribution), they also built one for model performance.
The shape of an F distribution is controlled by two values: the number of parameters in a model (k) and the degrees of freedom of a model (N-k). k controls where the peak of the distribution is horizontally, while the ratio of k to df controls what proportion of the area is in the major peak versus the tail.
The width of the null sampling distribution for a b estimate is directly proportional to the sample size N, such that smaller N's lead to wider distributions. This is because the estimates from smaller sample sizes are less stable, so larger estimates are considered more likely. The same principle applies in the null sampling distribution of $R^2$, but it's not just the sample size that determines this. It's the degrees of freedom, N-k. A model could be fit in a large dataset, but if a large number of parameters are also fit, the $R^2$ estimate will be less stable than a model fit with just a couple of parameters. This is because, with many parameters, there are a lot more ways the set of bs can vary! With more ways bs can vary, a wider range of $R^2$ is likely.
Once we have the null sampling distribution of $R^2$, we can compare our estimate of $R^2$ to it. In this comparison, we ask ourselves - if $R^2 = 0$ for this model in the population, is it likely or unlikely to draw a sample with an $R^2$ estimate such as ours?
In the null distribution of $\beta$, we considered something "unlikely" if it was outside the area where 95% of b estimates would fall. In other words, an unlikely b estimate is one that is at least 1.96 standard errors above or below $\beta = 0$.
#creating an empty vector of 1000 spots
null_b1 <- vector(length=1000)
#generate 1000 unique samples, saving each b1
for (i in 1:1000) {
GSS_subset$shuffled_mother <- sample(x=GSS_subset$highest_year_school_completed_mother,
size=length(GSS_subset$highest_year_school_completed_mother),
replace=FALSE)
model <- lm(highest_year_of_school_completed ~ shuffled_mother, data=GSS_subset)
null_b1[i] <- model$coefficients[[2]]
}
b1_df <- data.frame(null_b1)
#cut-off values for extremeness
b1s_sd <- sd(b1_df$null_b1)
high_cutoff <- sd(b1_df$null_b1)*1.96
low_cutoff <- sd(b1_df$null_b1)*-1.96
#marking something as extreme if it is greater than 1.96*sd or less than -1.96*sd
b1_df$extreme <- b1_df$null_b1 > high_cutoff | b1_df$null_b1 < low_cutoff
gf_histogram(~ null_b1, data = b1_df, fill = ~extreme)
There are two "zones" of unlikeliness in this distribution, because a b estimate can either be greater than 0 or less than 0. 2.5% of the distribution is in each tail, adding up to the 5% of bs that count as unlikely. To check whether a particular estimate of b is unlikely (and thus significantly different from 0), we'd ask if it's larger than the 97.5%ile OR less than the 2.5%ile of the distribution. This is a two-tailed test of significance.
simple_model <- lm(highest_year_of_school_completed ~ highest_year_school_completed_mother, data=GSS_subset)
b1 <- simple_model$coefficients[[2]]
quantile(null_b1, 0.025)
quantile(null_b1, 0.975)
b1
b1 > quantile(null_b1, 0.975) | b1 < quantile(null_b1, 0.025)
In the sampling distribution of $R^2$, an estimate can only be larger than 0, never less. There is one tail to the distribution. This means that if we were to use the same $\alpha$ criterion for significance as before, $\alpha = 0.05$, the 5% unlikely estimates will all be in the high tail. This is a one-tailed test of significance. Based on this fact, a significant $R^2$ is one that is larger than the 95%ile of the null sampling distribution.
full_model <- lm(highest_year_of_school_completed ~ highest_year_school_completed_mother +
highest_year_school_completed_father, data = GSS_subset)
R2 <- summary(full_model)$r.squared
quantile(null_R2, 0.95)
R2
R2 > quantile(null_R2, 0.95)
Remember that a p-value is the probability of getting an estimate that is as extreme or more extreme as the one we got, if the population parameter were 0. To extract the p-value for a model's $R^2$ score, we look at the very last line of the summary()
output:
summary(model_obj)
Don't be distracted by the p-values for each separate predictor. Right now we're not interested in how strong each of those effects are individually, just how well they combine to make predictions.
We now know how to find the model's $R^2$ value in this output: it is the number labeled "Multiple R-squared" on the second to the last line. For the p-value of that estimate, look below it at the last line of the output. There will be a number for "F-statistic" that tells you how extreme on the null F distribution this $R^2$ would fall, along with the two values that determine the shape of the F distribution: the number of predictors (2; the intercept isn't counted here) and the degrees of freedom (63). Next to that is a number labeled "p-value." That is the p-value for the entire model.
Notice that the p-value in our example is still significant (< 0.05) even though one of the predictors in the model is not. The model as a whole is likely to make better predictions than the null model, even if not all the predictors significantly contribute to those predictions.
We can use the model p-value to decide whether the observed sample $R^2$ would be unlikely or not if the null hypothesis is true. If we judge it to be unlikely (i.e., p < .05), then we would most likely decide to reject the null model in favor of this more complex model. But if the p-value is greater than .05, we would probably decide to stick with the empty model for now, pending further evidence. It's more parsimonious, and an insignificant model would be adding complexity without enough of an improvement in predictions to make that complexity worth it.
Testing the significance of a model's $R^2$ estimate lets you test whether or not the model as a whole is likely to perform similarly to the null model - whether or not this $R^2$ estimate likely comes from a world where all the beta estimates are truly 0. It's a test of whether or not we are confident that this model will perform better than nothing.
But that's not the only kind of question we might have about a model. Perhaps instead of a comparison to the null model, we want to make comparisons between different complex models. Consider a situation where we're using the data in the GSS to help us plan a study of our own that will explore factors related to educational attainment. We have a limited budget though, so we want to know if it helps to run a longer survey about multiple predictors (e.g., father's education and respondent's sex) or if just one of those in a shorter survey would work just as well. I.e., does a model with both father education and respondent's sex as predictors perform better than a model with only one?
Or in another, say we're curious about whether some predictor is significantly better than another in making predictions on its own. For example, maybe father's education is a better predictor than mother's education. If we use these in two separate models, which model will be better?
This process of choosing a candidate model among a set of options is called model selection. What process to use for model selection is actually a huge question in statistics without objective answers. This is because it depends on what you mean by a "better model." Is it one that results in less prediction error, or one that is closer to the true data generation process? These different goals can give you different answers for which model to choose, especially in a complex science where measurement error is present and thus a simpler model may propogate less error even if it's farther away from the truth than a more complex model.
When your goal is inference, you should first, before testing any data, try to specify a minimal set of candidate models based on theory and interpretability. Choose predictor sets that make sense given what is known about your research area first, and consider what variables are measurable and thus useful for making predictions with. This is because, the more candidate models there are to evaluate, the lower the probability is that you will find the right one in a single research study because of sampling error.
When your goal is prediction, considering a large initial set of candidate models (e.g. all possible permutations of collected variables) is justifiable as all you care about is minimizing model error, not finding the truth of the data generation process. But you need a lot of data to be able to make confident evaluations about the differences between the all the models in these cases.
Even within these goals, there are different model selection tools built on different assumptions about the data and underlying data generation process for judging which model has better performance. In the simplest case, we might just compare measures of explained error between two models. Try this below by first fitting a model predicting respondent's education with either just father education, or both father education and respondent's sex.
#Fit a linear model predicting highest_year_of_school_completed with just highest_year_school_completed_father
#in GSS_subset
father_model <- #YOUR CODE HERE
summary(father_model)
#Fit a linear model predicting highest_year_of_school_completed with highest_year_school_completed_father
# and respondents_sex
full_model <- #YOUR CODE HERE
summary(full_model)
The $R^2$ estimate for father_model
is 0.2897, while the $R^2$ estimate for full_model
is 0.3352. Both models themselves are significantly better than the null model, so to answer our question about which one explains more variation overall, let's look at the difference between these values. Just from the raw numbers, it looks like the full model will explain more variation in respondent education than the sub model. The difference in these $R^2$ estimates, $\Delta R^2$, is 0.0455 in favor of the full model.
However, we know that models with more parameters are going to explain more variation in the dataset the model was fit in, but they might not generalize well to other datasets or the population (the overfitting issue). Is this $\Delta R^2$ because it explains more variation in the population, or is it just the product of overfitting?
Instead of comparing model $R^2$ values directly, we should compare a metric that takes potential overfitting into account, and penalizes the score of a model with more parameters. That's what the "Adjusted R-squared" number is in the summary()
output of a model. $R^2$/PRE, as we have covered previously, is the proportion of total variance that a model explained as measured with sum of squares:
Mean squared error is an alternative way of representing error in a model that divides SS by the degrees of freedom in order to get an "average" squared error per observation.
$$ MSE = SS/df $$Since SS is divided by degrees of freedom here (N-k), this effectively penalizes a model for having more parameters; SS divided by a smaller degrees of freedom results in a bigger error measurement. Thus if we look at the proportion of MSE that a model explains instead of SS, we get the equation for $adj R^2$ that takes degrees of freedom (and thus number of paramters) into account:
$$ adj R^2 = \frac{MSE_{total} - MSE_{error}}{MSE_{total}} $$We can also get this value directly as a property of a model object summary:
#R-squared
model_summary$r.squared
#adjusted R-quares
model_summary$adj.r.squared
To compare model performance within a dataset then, we can look at the adj. $R^2$ values.
Yet, both of these $adjR^2$ values are point estimates, drawn from the population of possible $adjR^2$ values. We know not every sample will have these exact $adjR^2$ values. Is it possible that the $adjR^2$ for a full model is larger than the $adjR^2$ for a sub model just by chance? I.e., is the true $\Delta adj R^2$ in the population actually equal to 0?
We can develop a null hypothesis about any statistical estimate we want. In this case, the estimate we care about is not the $R^2$ of one model, but whether one is significantly larger than the other. Our null hypothesis is:
$$H_0: adjR^2_1 - adjR^2_2 = \Delta{adjR^2} = 0$$But in this case we can't necessarily use the F-distribution to calculate a p-value for the model differences, because $\Delta{adjR^2}$ might be positive or negative - model 1 can be better than or worse than model 2. The shape of this distribution also varies depending on many factors, such as the difference in the number of preditors in each model and whether or not these models are nested or non-nested (meaning, whether the set of predictors in model A is entirely a subset of the predictors in model B, or are there unique predictors in each model).
This is where the exact methods of model selection can get tricky, so we won't go into the math behind them in this intro class. Instead we can use the very convenient test_performance()
function in the performance
package to run a model comparison significance test. It will pick the correct method automatically depending on features about the models.
install.packages('performance')
library(performance)
test_performance(father_model, full_model)
This function outputs a table for every model compared to the first model given in the arguments list (you can enter as many models to compare as you want). In this case, that is father_model
. For our purposes, look at the 'LR' column and the 'p(LR)' column. LR stands for "likelihood ratio", a statistic being computed for the model comparison. This is somewhat like an F statistic for a predictor vs. null model (but it doesn't use the F distribution). If this number is negative, then the first model in the pairwise comparison has better performance. If this number is positive, then the second model in the pairwise comparison has better performance. p(LR) tells us the p-value of this comparison to know if either model is significantly better than the other.
The model selection exercise we did above highlights the importance of statistically comparing models, and not just looking at the difference in point estimates. Each model's $R^2$ is an estimate that varies sample to sample, so any difference between them also varies sample to sample and may be 0 in the population.
Sometimes you will see researchers do this incorrectly. There is published research out there where the analysts chose a model based on which point estimate was larger, but not by testing the difference in point estimates against a null distribution.
Another common comparison mistake is to say that, because one model was significant and another was not, therefor the models are significantly different from each other. To see why this is wrong, imagine the situation where model A's p-value is 0.035, and model B's p-value is 0.074. Model A is significant while Model B is not. Seems that would mean Model A is automatically better, right? But what if we told you the $adjR^2$ value of Model A was 0.22 and the $adjR^2$ value of Model B was 0.17. Thus, $\Delta adj R^2$ is 0.05. We've seen examples already where $\Delta adj R^2$ values like this may occur even when the population $\Delta adj R^2$ is actually 0. Thus, just because Model A's $adjR^2$ is significantly different from 0 and Model B's is not, does not mean that either of those models are significantly better than each other. It could be that Model B had slightly too few data points to find a confidence interval of the coefficients that does not include 0. This is particularly evident if you visualize both models:
Clearly, Model B does not look all that different from Model A. It just so happens that their estimates fall slightly on either side of the $\alpha = 0.05$ criterion.
So keep this in mind: if you ever want to make a statement that one thing is significantly better than another thing, you need to directly compare them with a statistical test. If comparing groups, make sure "group" is a variable in one model so that the beta coefficient of group differences can be tested against $\beta = 0$. If comparing models, make sure to use something like Vuong's test to robustly investigate the difference in model $adjR^2$ scores against a null sampling distribution of $\Delta adj R^2 = 0$.
Another thing to be aware of in model selection is what's called the multiple comparisons problem.
Imagine you want to select the best model among 4 options, A, B, C, and D. You start out by testing A against B, then against C, and against D. In each of these comparisons, you find that Model A was significantly better because the p-value of $\Delta R^2$ was <0.05 for each comparison (e.g., p = 0.01, p = 0.005, p = 0.04). Sounds great, right?
But there is a danger to doing multiple comparisons like this. When we do a hypothesis test, we set an $\alpha$ level to define what will count as an “unlikely” sample estimate. It is common to set $\alpha = 0.05$, such that any estimate in the null sampling distribution that is more extreme than the 0.05 probability area is considered unlikely to be produced by the null sampling distribution. But that doesn't mean the null sampling distribution can't produce it. It is possible that we reject the null hypothesis when it is in fact true - that we commit a Type I error.
In the event the null hypothesis is true, if we set our Type I error rate at 0.05 (that is, we define 5% of the least likely estimates as unlikely) and we do a lot of hypothesis tests, we will make a Type I error, on average, one out of every 20 times (5% of the time). When the null hypothesis is true, we will erroneously get a statistically significant result 5% of the time. On the flipside, that means we avoid making a Type I error 95% of the time.
This is okay if all we care about is a single test. But if we do three tests in sequence as in the example above, we want to achieve a 95% rate of avoiding Type I error across all three tests, not just for each one separately.
You can think of it like flipping a coin. If you flip a coin once, the probability that it will come up heads is 50%. But if you flip a coin three times, the probability of all three coming up heads is a lot less than 50%. Similarly, if you do a single hypothesis test, the probability of avoiding a Type I error is 95%. But if you do three hypothesis tests, the probability of avoiding a Type I error each time falls below 95%.
How much below 95%? If the probability of a true negative is 95%, the probability that all three tests are true negatives would be a joint probability 95%*95%*95%, or 0.857. Therefore, the probability that any one of these three tests is a false positive is 1 minus 0.857, or 0.143. This is called the family-wise error rate. What this means is that, when the null hypothesis is true, the probability of making a Type I error on any one of the three comparisons is 0.143 (which is higher than 0.05). With every additional hypothesis test we run, the chance increases that we've made a Type I error somewhere.
In modern scientific research, we are increasingly asking questions that involve many hypothesis tests. E.g. in model selection, where we compare many models to see which makes the best predictions; or in cases where there are multiple outcome measures to make predictions about. These scenarios can be found all across the social and life sciences. The more tests that are run, the bigger the Type I error and risk of erroneous conclusions such as finding significant brain activity in a dead salmon.
To control this family-wise error rate are a suite of methods that do multiple comparison correction. Some involve more sophisticated math than others, but all involve adjusting the p-values of a hypothesis test before comparing it to an $\alpha$ criterion for making decisions about significance. This adjustment is dependent on the number of tests we run. The simplest multiple comparison correction method is called the Bonferroni adjustment, named after the person who proposed it. In this method, if we want to maintain a 95% chance of not making a Type I error on any of our comparisons, we would simply divide our original $\alpha$ criterion (0.05) by the number of tests we ran (3). This will give us a new criterion to judge significance by (0.05/3 = 0.0167).
In our model selection example above, Model A was better than Models B, C, and D at p = 0.01, p = 0.005, and p = 0.04, respectively. If any of those comparisons had been done alone, we would simply compare those p-values to $\alpha = 0.05$ and find that Model A is significantly better. But because we did three tests, there is a risk that any one of those tests is an incorrect rejection of the null hypothesis. In order to control that risk with Bonferroni correction and keep the Type I family-wise error rate to 5%, we adjust down our $\alpha$ criterion based on the number of tests we ran, to 0.0167. p = 0.01 and p = 0.005 are still lower than this new criterion, so we still decide that Model A is significantly better than Model B or C. But p = 0.04 is not below this criterion, so Model A is no longer significantly better than Model D.
The Bonferroni adjustment is straightforward, and essentially makes our decision criteria for any one test more stringent. But some people think it is overly conservative - in other words, that it’s trying too hard to protect us from making Type I error. The corrected criterion can get very small if the number of simultaneous comparisons gets large. Although this decreases the chance of making a Type I error, it increases the chance of making a Type II error of not detecting a difference when one does exist. We've taught you the Bonferroni correction because it is the easiest to use, but other less conservative multiple comparison corrections are more common in research and include Tukey's Honestly Significant Difference Test and the Benjamini-Hochberg False Discovery Rate.
After reading this chapter, you should be able to: