Notebook

Previous: Chapter 15 - Estimating Populations

In [ ]:

# Run this first so it's ready by the time you need it
install.packages("ggformula")
install.packages("dplyr")
library(ggformula)
library(dplyr)
studentdata <- read.csv("https://raw.githubusercontent.com/smburns47/Psyc158/main/studentdata.csv")

Chapter 16 - Significance Testing¶

16.1 The null hypothesis¶

In this course we have discussed how we use statistics to test pre-existing hypotheses about data. So far, our hypotheses have only been as specific as "a predictor explains some of the variance in an outcome."

But we can define a more formal hypothesis than that. The strictest definition of a hypothesis is a pre-defined statement about what value a model coefficient will have, once we fit the model. Once we have a formal hypothesis about what the population parameter is, we can judge whether or not we think a sample of data likely came from that population or not.

For example, rather than hypothesizing generally that someone's height is related to their thumb length, a formal hypothesis would be "the effect of height for predicting thumb length is 0.9." In Frequentist statistics, we assume there is one population parameter from which random samples are derived. So we can collect some data, fit a model, build a confidence interval, and check if the estimated interval includes our hypothesized 0.9 parameter. If it does, then we would say the data is consistent with the proposed population - some data like this is likely to be generated by a population with a coefficient of 0.9. However if 0.9 is not in the confidence interval, we would be very surprised to hear that this data was indeed generated by this parameter. Only a very few rare samples would look like ours, if 0.9 were the true coefficient. We would think it more likely that a different sort of population generated the data.

Try thinking of some formal hypotheses for other research questions. E.g., predicting someone's resting heart rate based on how many hours during the week they spend in the gym. What do you think the specific value of $\beta_1$ would be in a model such as heart rate ~ gym hours?

If you're having a hard time thinking of where to even start with that question, you wouldn't be alone. A specific hypothesis is really tough to create and requires a level of knowledge about the data generation process that is unpractical for most topics we care about. It's much easier to hypothesize that there is just some sort of relationship between gym hours and heart rate, such that the population parameter is something that is NOT 0.

This is the crux of how Frequentists test hypotheses, called Null Hypothesis Significance Testing (NHST). In this process, your hypothetical thinking skills will be exercised. First, you pose a formal hypothesis that $\beta = 0$. This is the null hypothesis, that there is a null effect for $\beta$. The null hypothesis name is often abbreviated as $H_0$. Then, we collect data, make an estimate, and judge how likely a population with $\beta = 0$ is to produce these data. If it's very unlikely, we decide that the data were probably generated by some other population: the alternative hypothesis. We reject the null hypothesis.

It's important to remember that an estimate of a coefficient having a non-zero value is not enough to reject the null hypothesis. As we saw last chapter, a population with a specific parameter can generate samples with a range of estimates. So if we get a sample with $b = 0.5$, is that different enough from 0 to reject the null hypothesis? Or could that estimate have been reasonably generated by a population with $\beta = 0$?

To answer this question, we turn to a concept called statistical significance.

16.2 Sampling distribution of an estimate¶

Let's refresh on the concept of a sampling distribution in a concrete example before we learn how to test the statistical significance of a sample estimate. Going back to the studentdata dataset, we previously explored how someone's sex would explain some of the variation in thumb lengths. In other words, we thought there was a non-zero difference between the mean of female thumbs and the mean of male thumbs. We didn't have a more specific hypothesis than that, but that "non-zero" part is enough to define a null hypothesis and test whether or not these data likely came from a population that does have a zero difference between female and male thumbs.

In our equation of the statistical model of this question:

$$Y_i = b_0 + b_1X_i + e_i $$

$X_i$ represents whether or not someone is male or female. The parameter estimate $b_1$ is the effect of Sex (i.e., difference in thumb lengths between males and females), and is thus the specific one we are interested in. This is our best estimate of $\beta_1$, the true effect of sex in the population of thumb lengths.

Before we fit this model again and get an estimate, let’s imagine what we would expect to see if a particular version of the data generation process were true. If there is a real difference between male and female thumb lengths such that male thumbs were generally longer (i.e., if $\beta_1$ is a positive value), we might expect that samples from such a population would have positive $b_1$ values on average in the sampling distribution.

Although we couldn’t predict any single $b_1$ that will be drawn from the population, we can make predictions about the average $b_1$ that would be generated from multiple random samples. On average, the $b_1$ estimates would cluster around the parent $\beta_1$ from which they come. So a negative $\beta_1$ would tend to produce negative $b_1$, a positive $\beta_1$ would tend to produce positive $b_1$.

The null hypothesis is a special case in which $\beta_1 = 0$. If the null hypothesis is true it means that someone's sex has no effect on their thumb length. The $b_1$ values generated by multiple random samples from a population in which $\beta_1 = 0$ would tend to cluster around 0, but they wouldn’t necessarily be exactly 0. We can construct a sampling distribution to find out if a single $b_1$ for our sample could have been generated by the null hypothesis. This is called the null sampling distribution.

We can make a null sampling distribution from our own data with a type of simulation called permutation testing. In this, we permute or shuffle around the datapoints in the predictor variable, thus breaking any relationship between a predictor and the outcome. We can use sample() to do this, drawing all the datapoints in a variable randomly without replacement and then making a new variable out of that randomized vector.

In [ ]:

#first ten values before shuffling
studentdata$Sex[1:10]

studentdata$Shuffled_Sex <- sample(x=studentdata$Sex, size=length(studentdata$Sex), replace=FALSE)

#first ten values after shuffling
studentdata$Shuffled_Sex[1:10]

Breaking any association between Sex and Thumb by shuffling the values of Sex around makes it so that any still existing relationship between those variables is only due to randomness - the true association, or true $\beta_1$, is 0. Doing this many times and collecting the sample estimates of models fit on these shuffled data will give us a sampling distribution of $b_1$ when the null hypothesis is true.

In [ ]:

#creating an empty vector of 1000 spots
null_b1s <- vector(length=1000)

#generate 1000 unique shuffled samples, saving each b1
set.seed(47)
for (i in 1:1000) {
    studentdata$Shuffled_Sex <- sample(x=studentdata$Sex, size=length(studentdata$Sex), replace=FALSE)
    model <- lm(Thumb ~ Shuffled_Sex, data=studentdata)
    null_b1s[i] <- model$coefficients[[2]]
}

b1s_df <- data.frame(null_b1s)
gf_histogram( ~ null_b1s, data=b1s_df)

In this sampling distribution, we can see that $b_1$ varies each time we shuffle the data and calculate a new model fit. Also, the center seems to be around 0. We know from last chapter that the mean of a sampling distribution converges on the population parameter. Because the sampling distribution is based on the null hypothesis, where $\beta_1 = 0$, we expect the parameter estimates to be clustered around 0. But we also expect each individual one to vary because of sampling variation. We can see here that it's possible to get a $b_1$ estimate as high as +/-5 just by chance, even when there is no true difference between female and male thumbs!

However, these values are really rare in the sampling distribution. It's much more common to generate small mean differences (e.g., 0.6).

Just eyeballing the histogram can give us a rough idea of the probability of getting a particular sample $b_1$ from this population where we know $\beta_1$ is equal to 0. When we use these frequencies to estimate probability, we are using this distribution of shuffled $b_1$ values as a probability distribution.

16.3 Significance testing¶

We used R to simulate a world where the null hypothesis is true in order to construct a sampling distribution. Now let’s return to our original goal, to decide whether the null hypothesis is a good explanation for our data, or if it should be rejected.

The basic idea is this: using the sampling distribution of possible sample $b_1$ values that could have resulted from a population in which the null hypothesis is true (i.e., in which $\beta_1 = 0$), we can look at a specific sample $b_1$ that we estimated with data and gauge how likely such a $b_1$ would be if the null hypothesis is, in fact, true.

If we judge the $b_1$ we observed to be unlikely to have come from the null hypothesis, we then reject $H_0$ as our specific idea about the data generation process. If, on the other hand, we judge our observed $b_1$ to be likely (or at least not all that unlikely), then the null hypothesis could still be a good explanation for these data. We fail to reject the null hypothesis.

Let’s see how this works in the context of the studentdata dataset, where $b_1$ represents the difference in average thumb lengths between males and females.

Samples that are extreme in either a positive (average male thumbs much longer than females) or negative direction (average male thumbs much shorter than females) are unlikely to be generated if the true $\beta_1 = 0$. If we saw a sample $b_1$ like this, we would doubt that the null hypothesis produced it.

Put another way: if we had a sample that fell in either the extreme upper tail or extreme lower tail of the null sampling distribution (see figure below), we might reject the null hypothesis as the true picture of the data generation process.

In statistics, this is commonly referred to as a two-tailed significance test because whether our actual sample falls in the extreme upper tail or extreme lower tail of this sampling distribution, we would have reason to reject the null hypothesis as the true version of the population. By rejecting the null hypothesis, we decide instead that some alternative hypothesis where $\beta_1 \neq 0$ must be true. We wouldn’t know exactly what the true $\beta_1$ is, only that it is probably not 0. In more traditional statistical terms, we would have found a statistically significant difference between male thumb lengths and female thumb lengths. $b_1$ is significantly different from 0.

All of this, however, begs the question of how extreme a sample $b_1$ would need to be in order for us to reject the null hypothesis. What is unlikely to one person might not seem so unlikely to another person. It would help to have some sort of agreed upon standard of what counts as unlikely before we actually bring in our real sample statistic. The definition of “unlikely” depends on what you are trying to do with your statistical model and what your community of practice agrees on.

The common standard used in the social sciences is that a sample counts as unlikely if there is less than a 5% chance of generating a sample estimate at least as extreme as this one (either negative or positive) in the null sampling distribution. We notate this numerical definition of “unlikely” with the Greek letter $\alpha$ (pronounced “alpha”). A scientist might describe this criterion by writing that they “set $\alpha = .05$". If they wanted to use a stricter definition of unlikely, they might say “$\alpha = .001$,” indicating that a sample would have to be really unlikely for us to reject the null hypothesis.

Let’s identify an α-level of .05 in the null sampling distribution of $b_1$ we generated from random shuffles of thumb lengths. If you take the 1000 $b_1$ values and line them up in order, the 2.5% lowest quartile and the 2.5% highest quartile would be the most extreme 5% of values and therefore the most unlikely values to be randomly generated. So let's sort our null distribution and then find the value corresponding to the 2.5%ile and 97.5%ile. Since there are 1,000 data points in our simulated null sampling distribution, this would be the 25th and 975th values of the sorted vector.

In [ ]:

#cut-off values for extremeness
sorted_nullb1 <- sort(null_b1s)
high_cutoff <- sorted_nullb1[975]
low_cutoff <- sorted_nullb1[25]
high_cutoff
low_cutoff

#marking something as extreme if it is greater than 97.5%ile or less than 2.5%ile
b1s_df$extreme <- b1s_df$null_b1s > high_cutoff | b1s_df$null_b1s < low_cutoff

gf_histogram(~ null_b1s, data = b1s_df, fill = ~extreme)

A neat trick to know is, if we convert a sampling distribution to z-scores, we automatically know which samples are extreme or not by looking at which z-scores are above 1.96 or below -1.96, since 1 in z-scored units corresponds to 1 standard deviation and the 2.5%ile and 97%ile values are 1.96 SDs away from the mean.

16.4 The p-value¶

We have now spent some time looking at the sampling distribution of $b_1$ assuming the null hypothesis is true (i.e., $\beta_1 = 0$). We have developed the idea that simulated samples, generated by random shuffles of the thumb length data, are typically clustered around 0. Samples that end up in the tails of the distribution – the upper and lower 2.5% of values – are considered unlikely.

Let’s place our sample $b_1$ onto our histogram of the sampling distribution and see where it falls. Does it fall in the tails of the distribution, or in the middle 95%?

In [ ]:

sex_model <- lm(Thumb ~ Sex, data=studentdata)
b1 <- sex_model$coefficients[[2]]

gf_histogram(~ null_b1s, data = b1s_df, fill = ~extreme) %>% 
    gf_point(x = b1, y = 0, size = 3) %>% 
    gf_refine(coord_cartesian(x=c(-7,7)))

We can see that our sample is in the unlikely zone. It is beyond the point where the histogram of the null sampling distribution switches to blue extreme values. Thus we would reject that it comes from a population where male and female thumb lengths are the same, and say that our estimate of $\beta_1$ is significantly different than 0.

But we can say it more specifically than this, and give its location in the tail a quantitative score. Instead of only asking whether the sample $b_1$ is in the unlikely area or not (yes or no), we could instead ask, what is the probability of getting a $b_1$ as extreme or more extreme as the one observed in the actual experiment, if the null hypothesis were true? The answer to this question is called the p-value.

We know what our $\alpha$ is before we even do a study – it’s just a statement of our criterion for deciding what we will count as unlikely. Whatever the values in the null sampling distribution end up being, we define the "extreme" ones as the top and bottom 2.5% of the distribution. The p-value is calculated after we do a study, based on actual sample data. The p-value is calculated based on both the value of the sample estimate we want to assign a p-value and the standard error (which in turn depends on the sample size).

We calculate the p-value of any potential $b_1$ value as the cumulative probability that we would produce a $b_1$ as extreme or more extreme than this one when the null hypothesis is true. Since our $\alpha = 0.05$, any p-value < 0.05 would be considered significant.

We can use the summary() function we learned about earlier to find the exact p-value for a sample estimate:

In [ ]:

summary(sex_model)

Look at the last column in the Coefficients table called "Pr(>|t|)". This is the probability of getting a value more extreme than our $b_1$ estimate under the null hypothesis - aka the p-value. You can see it is a very small number - clearly smaller than our 0.05 $\alpha$ criterion. We would thus decide that a $b_1$ estimate of 6.056 is so unusual in the sampling distribution of $\beta_1 = 0$ that that probably is not the correct population parameter that created this estimate. We reject the null hypothesis.

You'll see the intercept has a p-value too, also a tiny number. This is saying that we can reject the null hypothesis that $\beta_0$ is 0. All fitted parameters in a linear model get a p-value, but you might not care to evaluate all of these parameters if your research question is about a particular one.

Something else to notice as well is the confidence interval for $b_1$:

In [ ]:

confint(sex_model, "SexMale", level=0.95)

This suggests that we are 95% confident that the true difference between male and female thumb lengths is somewhere between 3.57mm and 9.32mm. This interval does not include 0, so we are 95% confident that the null hypothesis is not the parameter that made these data.

A confidence interval tells you the same thing as a p-value, just in a different way. If the interval does not include 0, we decide that this estimate is significantly different than 0. If it does include 0, then 0 might reasonably be the parameter that created these data and we fail to reject the null hypothesis.

16.5 The t-distribution¶

The early statisticians who developed the ideas behind sampling distributions and p-values didn’t have computers. They couldn't even imagine what it might be like to shuffle their data a thousand times in just a couple seconds. What we have been able to do with R would seem like a miracle to them! So instead of using computational techniques to create sampling distributions, the early statisticians had to develop mathematical formulas of what the sampling distributions should look like, and then calculate probabilities based on these mathematical distributions. These calculations are what the summary() function uses to give us exact p-values, so we'll learn a little about that now to understand where these numbers come from.

The mathematical concept that summary() uses to understand the sampling distribution of coefficient estimates is known as the t-distribution. The t-distribution looks very similar to the normal distribution.

In the figure below we have overlaid the t-distribution (depicted as a red line) on top of a sampling distribution constructed with permutation testing. You can see that it looks very much like the normal distribution you learned about previously.

While the sampling distribution we created using permutation testing looks jagged (because it was made up of just 1000 separate estimates), the t-distribution is a smooth continuous mathematical function. It is the theoretically idealized shape of the null sampling distribution. If you want to see the equation that describes this shape, you can see it here.

Whereas the shape of the normal distribution is completely determined by its mean and standard deviation, the t-distribution changes shape slightly depending on how many degrees of freedom are in the samples that make up the sampling distribution. You can see how degrees of freedom affect the shape of the t-distribution in the figure below. Once the degrees of freedom reach about 30, however, the t-distribution looks very similar to the normal distribution.

To calculate a p-value, we find the cumulative probabilities in the upper and lower tails of the t-distribution. I.e., the area under the curve that is more extreme than + or - our sample $b_1$. Fortunately, you don’t have to do this math; R will do it for you in summary(). But you could do it with probability sampling functions like these, which work the same way as rnorm() or pnorm() that we learned in chapter 8.

These functions give you a t-value, telling you where on the t-distribution your sample $b_1$ falls if the null hypothesis is true. This specific t-value can be found in the third column of the Coefficients table in the summary() output.

In [ ]:

#Write some code to again output the bigger results table for sex_model, not just the coefficients

Because this is the traditional method of determining p-values and statistical significance, sometimes you'll read research papers where the authors ran a linear model but also report a t-value. This does not mean that they did a t-test, only that they're reporting the corresponding t-value of their beta estimate that is found in the linear model output.

16.5 Things that affect the p-value¶

In our one-predictor model, the sample $b_1$ for the difference between male and female thumb lengths was 6.056. Based on the the output of summary(), the probability of getting a sample with a $b_1$ as extreme or more extreme than 6.056 when the null hypothesis is true is approximately 0.000113. Based on our $\alpha$ criterion of .05, we decided that $b_1 = 6.056$ is significantly different from 0, and we reject the null hypothesis. Some alternative population likely make these data instead.

Compare that to a $b_1 = 3$. This isn't our real sample estimate, but we can use permutation testing to find what the p-value of this estimate would be in our simulated null sampling distribution. To do so, we simply find how many values in the null_b1s vector are above 3 or below 3. This number divided by the size of the full simulated sampling distribution gives us the proportion of values more extreme than 3 - aka the p-value.

In [ ]:

num_above <- sum(null_b1s > 3)
num_below <- sum(null_b1s < -3)

pvalue <- (num_above + num_below) / length(null_b1s)

pvalue

According to this simulation, a $b_1$ of 3 would have a p-value ~ 0.072. This is larger than the $\alpha$ value of 0.05, so we fail to reject the null hypothesis. This kind of group difference is not so unlikely that we think the null hypothesis didn't generate it.

As evidenced here, the p-value is affected by how far the observed $b_1$ is from 0. Since 6.056 is further away from 0 than 3 is from 0, $b_1 = 6.056$ has a smaller p-value. The further away $b_1$ is from 0, the lower the p-value, meaning the less likely the observed $b_1$ estimate is to have been produced by the null hypothesis.

But the distance between $b_1$ and 0 (or the hypothesized $\beta_1$) is not the only thing that affects a p-value. The other important factor is the width of the sampling distribution, also known as the standard error.

Take a look at the two simulated sampling distributions in the figure below. The one on the left is something like what we created in our permutation test earlier, where $b_1 = 3$ is not significant. The one on the right is similar, but narrower. Both have a roughly normal shape, both consist of 1000 sample estimates, and both distributions are centered at 0. But the standard error is smaller for the distribution on the right. By being narrower, the sampling distribution on the right brings the extreme zone farther in, and makes a $b_1$ value of 3 now significant.

The standard error can make a big difference in our ability to reject the null hypothesis. If it is smaller, we will have an easier time rejecting the null. This is because whatever estimate we get for $b_1$, it will be more likely to be in the upper or lower .025 of the null sampling distribution.

We learned last chapter that the size of the standard error is tied to the size of the samples within it. Samples with fewer datapoints will have more varied parameter estimates, and thus the sampling distribution will have a wider standard error. In the above figure, the left sampling distribution was simulated with sample sizes of N = 100. The right sampling distribution was simulated with samples sizes of N = 300.

Our studentdata dataset has 157 data points in it, which is a fine sample size. But let's see what would happen if it were a much smaller study, with only 20 people. We'll use slice_sample() to draw only 20 random people from this dataset and then fit a model on those values.

In [ ]:

set.seed(47)
smaller_studentdata <- slice_sample(studentdata, n = 20, replace = FALSE)

smaller_sex_model <- lm(Thumb ~ Sex, data = smaller_studentdata)

summary(smaller_sex_model)

This standard error is much larger than we saw before. This means that extreme values of the null sampling distribution go further, and a wider range of $\beta_1$ values may have created this $b_1$ estimate. Our estimate $b_1$ of the effect of Sex is no longer significant.

Sample size is thus directly tied to our ability to reject the null hypothesis, independent of what our $b_1$ estimate actually is. Running a bigger study makes the null sampling distribution narrower, and thus it is easier to reject the null hypothesis. This is relevant for when we think the true value of a coefficient is something small or close to 0. If we want to get a good estimate of it (narrow confidence interval) and be confident it is not 0 (p < 0.05), we need to collect a lot of data.

But this doesn't mean that if we get a small p-value with a small study, that we're in the clear. Coefficient estimates in general are less stable when sample sizes are small, meaning the difference between whether something is significant or not could be the inclusion/exclusion of one extreme data point, or some other small modeling choice. Fishing around for the perfect configuration of data that makes your results significant is called p-hacking, and increases the chance that you incorrectly reject the null hypothesis when it is actually true (more on this in chapter 22).

We can also take the relationship between p-values and sample size to absurd limits, to where it may be possible to collect too much data. Now let's sample giant datasets, of one hundred thousand data points. We'll use replace=TRUE in the slice_sample() function or else we would run out of datapoints to use for this simulation.

In [ ]:

set.seed(47)
bigger_studentdata <- slice_sample(studentdata, n = 100000, replace = TRUE)

bigger_sex_model <- lm(Thumb ~ Sex, data = bigger_studentdata)

summary(bigger_sex_model)

The standard error in this case is much smaller. Even a mean difference in thumb lengths as small as 0.15mm would be considered significantly different than 0 with this sample size.

In very large datasets, nearly every model coefficient you estimate will be statistically significant. Significance then becomes a less useful concept. Sure 0.15mm is significantly different from 0 when it's estimated from 100k datapoints, but how much does a difference that tiny matter to us for using the model? Do we care that male and female thumb lengths would be different by only 0.15mm? Or is that so small that it's no longer practically different than 0?

16.6 The limits of significance testing¶

Significance testing is a major component of modern psychology research. Hypotheses and theories are made or broken on the back of p < 0.05. But the strong reliance on p-values for determining whether effects exist or not is not always a good thing. There are several limits to what conclusions we can make using significance testing, and sometimes people push past these limits.

Type I and Type II error¶

First, we typically set our $\alpha$ criterion to be 0.05, meaning that any $b_1$ value that is as extreme or more extreme than the 5% most extreme values in the null sampling distribution will be treated as unlikely. For unlikely estimates, we decide that they are probably not from the null hypothesis at all. This could be the right decision…

But it might be the wrong decision. If the null hypothesis is true, 5% of the $b_1$ values that could appear would be extreme enough to lead us to reject the null hypothesis. We would be incorrectly deciding that these data came from a different distribution than they actually did. If we rejected the null hypothesis when it is in fact true, we would be making what's called a Type I error or false positive.

Our chance of making this type of inference error is directly tied to the $\alpha$ level we chose. By setting $\alpha = .05$, we are saying that $b_1$ values above or below 1.96 SDs of the null sampling distribution are so unlikely that we think they're from a different population. But by definition, 5% of the $b_1$ values in the $\beta_1 = 0$ sampling distribution are this level of unlikely. Thus when the null hypothesis is true, 5% of samples drawn from the population will lead us to the wrong conclusion.

We can reduce this error rate by making our $\alpha$ value smaller. Maybe instead of 5% of the null sampling distribution being surprising, we set $\alpha = 0.001$ such that only 0.1% of the null sampling distribution would be considered unlikely. This would make it harder for us to erroneously reject the null hypothesis.

However, that causes us other problems. Now, it is harder to detect when the null hypothesis should be rejected. We need much stronger evidence to do so when $\alpha = 0.001$ than when $\alpha = 0.05$. If we fail to reject the null hypothesis when it should be rejected, this is called a Type II error or a false negative. Type I and Type II errors are always a tradeoff with each other - where we reduce the chance of one, we increase the chance of the other. Also, it is never possible to completely eliminate the risk of either. This is an inherent limitation of Null Hypothesis Significance Testing.

Can only reject the null, not support it¶

A second limitation of NHST is that, while it is possible to declare a sample estimate as too unlikely for us to think it comes from the null hypothesis, it is not possible to be sure an estimate definitely does come from the null hypothesis. To be concrete, what if a $b_1$ estimate doesn’t fall in the tails of the null sampling distribution but instead falls in the middle part? Should we say that we're confident that the $\beta_1 = 0$ is the truth? With significance testing of one sample estimate, we can never confirm a specific null hypothesis, only reject it or fail to reject it. This is because there are infinite other population parameters slightly above or below 0 that could likely generate this estimate as well. You can't prove that any estimate came from $\beta_1 = 0$ specifically and not $\beta_1 = 0.1$. You can only say there is a very high likelihood that something did not come from a particular population like $\beta_1 = 0$. Null hypothesis testing is about rejecting some reality, not confirming one. If we fail to reject the null hypothesis, the true population parameter could be 0, but it could also be something else sort of close to 0. We don't know. (Note that there is a way to build support for the null hypothesis, called equivalence testing, but that is a more advanced approach outside the scope of this course.)

Significance is a binary decision¶

Thirdly, another limitation of NHST is that there are not different levels of statistical significance. We can compute different p-values for different $b_1$ estimates and find that one has a higher probability to appear under the null hypothesis than another. But the decision of something being significantly different from 0 is a binary decision - it either is significant, or it is not. Something that is in the top 2% of the null sampling distribution is just as significant as something that is in the top 0.02% of the distribution, because both are beyond the alpha cutoff. Null hypothesis testing doesn't give us a way to test whether different p-values are significantly different from each other. Thus a sample estimate is either significant, or it's not. We decide it either didn't come from the null hypothesis, or there's not enough evidence to decide. You can't have one estimate that is more or less significant than another, and you can't have something that is almost significant.

Probability of data, not population¶

Fourthly, the specific meaning of a p-value can be tricky to get right. It is the probability that, given the null hypothesis being true, we would find a sample value as extreme or more extreme than the one we did. Thinking in terms of conditional probability again, it is $P(b|\beta=0)$. A p-value does NOT mean the probability that the null hypothesis is true. In Frequentist statistics, a population parameter cannot have a probability - it exists as a single entity that can't be repeated multiple times like a sample. The null hypothesis thus does not have a probability. These data have a probability, given the null hypothesis being true.

P-values and sample size¶

Lastly, as we explored previously, the p-value we get is tied to our sample size. In small amounts of data, we would need to estimate a really large effect in order to call it significant, because the confidence interval is so wide that it covers many values including 0. Thus when we only have a small amount of data available to us, we might not be able to declare anything as significant. But in large amounts of data, almost everything is significant, even effects so tiny that they're not useful in the real world. Would it matter to you to hear that a predictor in a model was significant, but the model only explained 0.01% of the variation in the outcome data? P-values can thus be "gamed" with large sample sizes. They tell us something about the confidence of this true parameter being 0 or not, but they don't help us with determining whether this effect size is useful for real-world purposes.

In summary, there are five main hang-ups that cause a lot of consternation for people trying to use null hypothesis testing:

1. The risk of Type I and Type II error is everpresent
1. You can only reject the null hypothesis, or fail to reject the null hypothesis (not confirm the null hypothesis)
1. You can only decide if a sample estimate is significantly different from 0 or not; there are no levels of significance
1. A p-value is the probability of these data given the null hypothesis, NOT the probability of the null hypothesis given these data.
1. P-values are tied to sample sizes, so it is possible to miss true effects in small datasets and to get a significant p-value with an impractically-tiny effect in large datasets.

This is a long section on the difficulties of null hypothesis testing because a lot of people use it incorrectly, even professional scientists. NHST is a powerful approach but it's tricky to interpret correctly, and it often gets used for the wrong questions that it is just not able to answer.

Thus if you default to only ever looking at p-values in your research and not the wider context of what your model looks like and what goals you have for it, you run the risk of misinterpreting your statistics. Just because a p-value is significant does not mean your model can be used in the way you want to, and just because a p-value is insignificant doesn't mean your model is useless. We'll cover the idea of practical vs. statistical significance in a later chapter but for now, remember that p-values aren't the whole story! We also want to judge what sort of predictions our models are making, do we like that level of accuracy or not, and is there any way we can think of to make our model better.

16.7 Significant effects in practice¶

Effects in the simple model¶

Let's practice using and interpreting p-values with some concrete research situations. One hypothesis we have used a lot is whether someone's sex explains some of the variation in their thumb length. A research question like this implies that we are interested in the predictor "sex" specifically, and its unique contribution to explaining variation in thumb length. We recognize that there are likely many factors that lead to how long someone's thumb is, and we may even know some more of them and be able to model them as well. But our interest right now is on the effect of specifically someone's sex in the data generation process. Is it a meaningful contributor to explaining variation in thumb length on its own? Or is there not really an effect?

Using the framework of Null Hypothesis Statistical Testing, we can answer this sort of question. We just have to go through five steps in order to frame this question correctly for the NHST.

First, we need to pick the variables we're using to test this question. We'll use the outcome variable Thumb from the studentdata dataset, as we have been, and we'll use Sex to predict it.

Second, we need to formulate the null hypothesis. We think that sex is interesting if it predicts some variation in thumb length - that there is a relationship between the two. That implies that there would be a non-zero coefficient for Sex in a linear model. It would not be an interesting variable if there was no change in predicted thumb length when sex varied - i.e., if the $b_1$ coefficient were 0. Thus, we define the null hypothesis to be:

$$H_0: \beta_1 = 0$$

In a world where the true population parameter $\beta_1$ is 0, where there is no real effect of sex, we wouldn't be so interested in it as a predictor. So we want to test if it's likely that $b_1$ in our sample came from a world where $\beta_1$ is 0, or if we want to reject that explanation.

In step 3, we now need to fit a model in the data. We need to specify the equation of that model so we know what we're testing:

$$\hat{Y} = b_0 + b_1X_i$$

Where $X_i$ is Sex. Knowing that equation, we automatically fit it with lm() as we have done many times before:

In [ ]:

model_obj <- lm(Thumb ~ Sex, data = studentdata)

From this model, we can extract our estimate of $b_1$:

In [ ]:

model_obj$coefficients[2]

Fourthly, we determine the probability of getting a $b_1$ like this, were the null hypothesis true - if $\beta_1$ truly equals 0. Use summary() to find this p-value for the $b_1$ coefficient:

In [ ]:

#Use summary() to see the p-value of the effect of Sex

And as our final step, we look at that p-value and decide whether or not it is significant - whether or not we should reject the hypothetical world of $\beta_1 = 0$. As psychologists we are likely using an $\alpha$ criterion of 0.05, so we will check if the p-value for $b_1$ is < 0.05.

According to this output, it is. Thus, we reject the null hypothesis that there is no effect of sex. Further, if we find the 95% confidence interval for $b_1$:

In [ ]:

#Use confint() to find the 95% CI of the SexMale estimate

We see that the interval includes only positive values. Thus we reject the null hypothesis that there is no effect of sex, AND we say we are 95% confident the true $\beta_1$ is a positive number. As someone's sex label switches from 0 to 1 (difference from female to male), the change in thumb length is likely to increase as well.

Congrats, you have tested a hypothesis and come up with a conclusion! You can now write up your study and send it for peer review! In APA style, we would write this result as:

"There is a significant effect of sex (b = 6.056, p <0.001, 95% CI [3.037, 9.075])."

By following these five steps, you were able to fit and evaluate a model of the data generation process for thumb length. Just remember that the conclusions we can make about this research question are limited, when we use null hypothesis testing. In the context of this analysis:

We can't be sure that $\beta_1$ doesn't equal 0 - we're only very confident.
If our p-value had been > 0.05, we couldn't say that we're sure $\beta_1$ equals 0, because we can't confirm the null hypothesis - only reject it or fail to reject it.
If our p-value had been 0.051, that would still be an insignificant effect. We set our $\alpha$ decision criterion to be 0.05, and we need to stick to that.
Our p-value is the probability that we would get this $b_1$ estimate if $\beta_1 = 0$; it is NOT the probability that $\beta_1 = 0$ given this $b_1$ estimate. Those statements are different conditional probabilities and thus different numbers (and a Frequentist would say you can't get a probability of $\beta_1$ anyways).
Our p-value is tied to our sample size N=157. If we had more or less data, we might have made a different decision.

Effects in the multivariable model¶

As we saw in chapter 13, predictors in a multivariable model can be correlated. Thus if we're interested specifically in the effect of sex, we may want to control for what might be a better explanatory variable, like height. This would enable us to investigate if sex is a significant, unique effect in its own right or if it's only related to thumb length by virtue of being related to height.

To do this, first we pick our variables. We'll use Thumb and Sex as before, as well as Height as another predictor.

Second, specify the null hypothesis. Here we're still interested in the effect of sex in particular. We're just using height as a control. So the null hypothesis is still:

$$H_0: \beta_1 = 0$$

Third, specify and fit the model:

$$\hat{Y} = b_0 + b_1X_{1i} + b_2X_{2i}$$

In [ ]:

model_obj <- lm(Thumb ~ Sex + Height, data = studentdata)

Fourth, find the p-value of $b_1$:

In [ ]:

summary(model_obj)

And finally, make a decision about whether or not sex is significant.

Because sex and height are related, there is shared variation between them - both of them explain some of the same variation in thumb length. We need to put them both in a model together in order to disentangle what amount of unique variation they each explain. By doing this, we see that the effect of sex got smaller, after taking height into account - the coefficient fell from 6.056 to 2.998. This effect is so small, that it is no longer significant - the p-value is 0.114, which is above our $\alpha$ cutoff of 0.05. We'd interpret this to mean that in a world where there is no unique effect of sex ($\beta_1 = 0$), a $b_1$ estimate of 2.998 would be a likely outcome. We can further see this by checking the 95% confidence interval:

In [ ]:

confint(model_obj, "SexMale", level=0.95)

This range includes 0 as one of the $\beta_1$ values that are likely to produce $b_1 = 2.998$. Thus, with the data we have, we fail to reject the null hypothesis that there is no unique effect of sex. We'd report this result as:

"The effect of sex was insignificant when controlling for height (b = 2.998, p = 0.114, 95% CI [-0.728, 6.725])."

Note again that this does not mean $\beta_1$ is definitely 0. There are still many non-zero values in the confidence interval that could be the truth. It could be that we only failed to find a significant effect because studentdata only had a sample size of N=157. If we ran this model in a much larger dataset, we could create a narrower CI that might not overlap with 0. It's easier to distinguish small effects from $\beta_1 = 0$ with more data. But in that case, it's always worthwhile to ask yourself if an effect that is significant, but with a very small coefficient, is worth anything to you in practical terms. Statistically significant doesn't mean important, only unlikely to be 0.

Effects in interaction models¶

When we build interaction models, we are often most interested in the interaction term. We want to know if the effect of one variable depends on values of the other (a non-zero interaction coefficient), or if the effects of each variable operate on the outcome variable independently of each other (a zero interaction coefficient). Thus, when we have hypotheses about interactions, we usually assess the significance of the interaction term.

Let's test the hypothesis that there is an interaction between sex and height on thumb length. Our variables will be the same as in the multivariable case, Thumb, Sex, and Height.

Our null hypothesis is now for the interaction effect, and not the main effect of sex:

$$H_0: \beta_3 = 0$$

Specifying and fitting our model, we get:

$$\hat{Y} = b_0 + b_1X_{1i} + b_2X_{2i} + b_3X_{1i}X_{2i}$$

In [ ]:

model_obj <- lm(Thumb ~ Sex + Height + Sex*Height, data = studentdata)

summary(model_obj)

Our hypothesis is about the interaction effect $b_3$, so investigate the estimate and p-value on the line for SexMale:Height. We see that the p-value of the interaction is p = 0.6145, which is not below the $\alpha$ criterion of 0.05. Thus we fail to reject the null hypothesis that there is no interaction. We can't be sure that there is no interaction truly in the population, but we are pretty confident that the true interaction effect is something close to zero:

In [ ]:

confint(model_obj, "SexMale:Height", level = 0.95)

Without a significant interaction, we would interpret the relationship between sex, height, and thumb length just as main effects.

Further, by including more predictors in the model, we've soaked up more degrees of freedom. That means the standard error for the effect of Height (which is calculated based on degrees of freedom) is now larger (and in fact not significant). It doesn't matter that we previously fit a model where Height was significant - this was the model we chose to test, and this one says Height does not have a significant main effect. We can't go fishing around for whichever model makes our hypothesis look better.

We'd report this as:

"There was no significant interaction between the effects of sex and height on thumb length (b = 0.258, p = 0.615, 95% CI [-0.753, 1.270]). Additionally, there was no main effect of Height (b = 0.556, p = 0.0598, 95% CI [-0.023, 1.13]) or Sex (b = -14.526, p = 0.677, 95% CI [-83.231, 54.179]."

16.8 Frequentist vs. Bayesian hypothesis testing¶

The ideas about hypothesis testing that we covered in this and last chapter come from the perspective of what's called Frequentist statistics. Almost every textbook for undergraduate psychology students starts with this framework, as the Frequentist view of statistics dominated the academic field of statistics for most of the 20th century and is a popular collection of tools among applied scientists. It was and is the most common practice among psychologists to use Frequentist methods. Because Frequentist methods are ubiquitous in scientific papers, every student of statistics needs to understand those methods.

However, Frequentist statistics can be frustrating to interpret. The major tenet of Frequentism is that there is only one state of the world. A hypothesis you are testing is either true, or it is not. We can describe the probability of generating certain datasets given the true state of the world, but if we don't know that state, we can't describe the probability of what that state is given a particular dataset we have collected. Unfortunately, that's usually the case we find ourselves in - we don't know the true state of the world, and our only hope to figure it out is by looking at some data. Frequentism thus isn't an ideal method for answering that question, and we have to do some brain bending things to make it work for us anyways:

We can't figure out the probability of the true population parameter, so we have to think up some hypothetical parameters and see how likely those are to produce our data (p-values)
We then have to reduce these probabilities to a binary yes/no decision about whether we want to accept this hypothetical parameter as being the truth (null hypothesis testing)
We can only ever reject a null hypothesis, but we can never build support for it

If these topics have felt abstract and difficult to master, you wouldn't be alone. The convoluted nature of statistical inference with Frequentist methods partially explains errors that can be found in published research.

This is why some people prefer a different approach to inferential statistics, called Bayesian statistics. These ideas as a collection are named after Thomas Bayes, the mathematician who contributed much to our understanding of probabilities. While we won't go in depth into this approach, it's good to know that there are other ways of thinking about statistical evidence.

Bayesian statistics have a fundamentally different view of probability than Frequentist statistics do. In Frequentism, probability means the long-run proportion across many samples. If the true population parameter $\beta$ is 0, we can't measure it directly, but we can estimate it many times across many samples. This builds a sampling distribution, composed of many b estimates.

Because we sample these b estimates over and over and get different values each time, the b estimates can have probability. There is some proportion of these estimates, across many samples, that will equal a particular value.

To a Frequentist, the population parameter $\beta$ cannot have a probability. It is inherently just one value, and it isn't generated by some underlying process. It just is. A population parameter is either equal to a particular value, or it is not. There's no way to get many values of it for measuring a proportion.

In contrast, in the Bayesian perspective, probability means something different. To a Bayesian, probability is your strength of belief about the population parameter. There is still one population parameter that we don't know, but until we know it there are many possible values is could be.

We can believe more strongly in some possible values than others - one value is more likely to be the truth than another value. In this sense, a population parameter can have a probability. Do we think a $\beta$ value of 0 is more likely to be the truth than a $\beta$ value of 10? What value do we believe in most strongly?

While Bayesian statistics is not practiced as widely as Frequentist statistics, it is gaining ground among psychologists and can be used for many of the same use cases as Frequentist statistics. With this alternate approach to defining probability comes a whole suite of methods for analyzing data. If you're interested in learning about such an approach, you can read through a secret optional chapter of this textbook here.

Chapter summary¶

After reading this chapter, you should be able to:

Define the null hypothesis
Explain what it means to reject the null hypothesis
Shuffle data to create null sampling distributions
Explain the alpha criterion
Define the meaning of a p-value
Find the p-value of a model estimate
Explain the limits of null hypothesis testing

New concepts¶

Null Hypothesis Significance Testing (NHST) - A statistical method that tests whether a sample of data is likely under a specific hypothesis of the population parameter being 0.
null hypothesis - The specific hypothesis being tested in NHST. Often, it is that the true population parameter = 0.
alternative hypothesis - In NHST, the hypothesis that would be true if the null hypothesis is false. Often, it is that the true population parameter is something other than 0.
statistical significance - A judgment that a sample estimate is unlikely to have happened if the null hypothesis was true.
null sampling distribution - The distribution of sample estimates that could be created from a population where the null hypothesis is true.
permutation testing - A type of simulation that randomly shuffles values in a variable in order to remove any association with another variable. Repeating this process approximates the null sampling distribution of an estimate.
p-value - The probability that a particular estimate or something more extreme would happen if the null hypothesis were true.
t-distribution - The theoretical shape of the null sampling distribution of a model coefficient estimate.
Type I error - One commits a Type I error or false positive decision if they reject the null hypothesis when it is in fact the truth. Defined by the significance decision criterion $\alpha$.
Type II error - One commits a Type II error or false negative decision if they fail to reject the null hypothesis when it is not the truth.

Next: Chapter 17 - Significance Testing Whole Models