In this tutorial, we are going to see how to determine whether there is a significant association between two categorical variables.
The $\chi^2$ test for independence is a statistical test based on the $\chi^2$ statistic, which compares the observed frequencies in a contingency table with the expected frequencies under the assumption that the two variables are independent. In this tutorial, we will learn how to perform the $\chi^2$ test for independence in R.
To illustrate this in this tutorial, we will employ again the dataset we've just used in the previous tutorial on plotting categorical data. These data contain the results from a clinical trial, in which it was administered a drug to some participants and a placebo to others, and their progress was monitored. At the end of the study, it was determined whether the participants' progress had worsened, remained the same, or improved.
dat.tutorial<-read.csv("https://raw.githubusercontent.com/jrasero/cm-85309-2023/main/datasets/tutorial8chisquare.csv")
To perform the $\chi^2$ test for independence, we first need to create a contingency table of the two categorical variables. We can do this using the table
function as follows:
table(dat.tutorial$group,dat.tutorial$evolution)
Better Same Worse Drug 30 28 11 Placebo 25 44 33
Alternatively, you may also produce the same table combining formula with xtabs
:
xtabs(~group + evolution, dat.tutorial)
evolution group Better Same Worse Drug 30 28 11 Placebo 25 44 33
In both cases, the contingency table displays the observed frequencies of the group and evolution variables. For instance, 30 participants received the drug and exhibited a better evolution, 28 had no change in their evolution, and 11 experienced a worse evolution. Among those who received the placebo, 25 experienced an improvement in their evolution, 44 had no change, and 33 experienced worsened their evolution.
Now, once we have got our contingency table, we have everything that is need to conduct the $\chi^2$ test in R. This is done using the chisq.test
function, which takes a contingency table as input and returns the $\chi^2$ statistic, the degrees of freedom, and the p-value.
For example, to conduct a $\chi^2$ test for independence on the contingency table that we created above:
chisq.test(table(dat.tutorial$group,dat.tutorial$evolution))
chisq.test(xtabs(~group + evolution, dat.tutorial))
Pearson's Chi-squared test data: table(dat.tutorial$group, dat.tutorial$evolution) X-squared = 8.976, df = 2, p-value = 0.01124
Pearson's Chi-squared test data: xtabs(~group + evolution, dat.tutorial) X-squared = 8.976, df = 2, p-value = 0.01124
As we can see, the p-value ~ 0.01, which if we assume a type I error rate $\alpha = 0.05$, we should reject the null, that is, it seems that there is an association between receiveing the drug or not and the change in health evolution.
How is this computed? Well, the $\chi^2$ test for independence uses both the observed frequencies and the expected ones to compute the $\chi^2$ statistics. Fortunately, R does all these computations when calling the function chisq.test
:
res<-chisq.test(table(dat.tutorial$group,dat.tutorial$evolution))
names(res)
# The observed frequencies (i.e. the table we input)
res$observed
# The expected frequencies
res$expected
Better Same Worse Drug 30 28 11 Placebo 25 44 33
Better | Same | Worse | |
---|---|---|---|
Drug | 22.19298 | 29.05263 | 17.75439 |
Placebo | 32.80702 | 42.94737 | 26.24561 |
The Fisher's exact test is a statistical test used to determine whether there is a significant association between two categorical variables. It is particularly useful when the sample size is small and the expected cell counts are less than 5, given that the $\chi^2$ test is not recommended in this kind of scenarios.
The Fisher's exact test is based on the hypergeometric distribution and computes the probability of observing a particular table given the marginal totals:
The null hypothesis of Fisher's exact test is that there is no association between the two variables.
N.B. This test can only be used with 2x2 contingency tables.
Let's create a 2x2 contingency table for running this demonstration:
fisher.table<-matrix(c(5,1, 2,7), nrow=2, byrow=TRUE)
colnames(fisher.table)<-c("A", "B")
rownames(fisher.table)<-c("a", "b")
fisher.table
A | B | |
---|---|---|
a | 5 | 1 |
b | 2 | 7 |
Let's see first what happens when we try to run a $\chi^2$ test on this table:
res.chisq<-chisq.test(fisher.table)
res.chisq
Warning message in chisq.test(fisher.table): “Chi-squared approximation may be incorrect”
Pearson's Chi-squared test with Yates' continuity correction data: fisher.table X-squared = 3.2254, df = 1, p-value = 0.0725
As you can see, there is an warning message indicating that this test might not be correct. This appears to be due to the fact that the expected occurrences in some of the cells are below 5. Let's check it out:
res.chisq$expected
A | B | |
---|---|---|
a | 2.8 | 3.2 |
b | 4.2 | 4.8 |
In these cases, a Fisher's exact test might be a more appropriate test choice. In order to perform this kind of test in R, you can use the fisher.test
function, which takes also a contingency table as its input and returns the p-value of the test, as well as the odds ratio and its confidence interval.
res.fisher<-fisher.test(fisher.table)
res.fisher
Fisher's Exact Test for Count Data data: fisher.table p-value = 0.04056 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.8646648 934.0087368 sample estimates: odds ratio 13.59412
Here, the p-value ~ 0.04, which is below the usual type I error $\alpha=0.05$ and therefore we should reject the null hypothesis that there is no association between both categorical variables.
This example is interesting, because had we used the $\chi^2$ test with the same level of significance, the conclusion regarding the null hypothesis would have been different:
# p-value running chisquare
res.chisq$p.value
# p-value running fisher
res.fisher$p.value
Nevertheless, the fact that both p-values are so close to the assumed Type I error $\alpha=0.05$ should make us be cautious when drawing any conclusion. For example, while a p-value of 0.051 is technically above the threshold, and a p-value of 0.049 is below it, in practical terms the difference between them is negligible. Therefore, we should not attach too much significance to a p-value that is only slightly below the threshold, nor dismiss one that is only slightly above it.
The McNemar test is a statistical test used to determine if there is a significant difference between paired proportions. It is commonly used to compare the frequency of occurrence of a particular outcome before and after an intervention or treatment. The test is based on the binomial distribution and assumes that the data are PAIRED. If your data are not paired, you should use a different test, such as the $\chi^2$ test for independence or the Fisher's exact test.
N.B. This test can only be used with 2x2 contingency tables.
# Create a 2x2 contingency table
table.mcnemar <- matrix(c(5, 3, 12, 3), nrow = 2, byrow = TRUE)
colnames(table.mcnemar) <- c("Yes", "No")
rownames(table.mcnemar) <- c("Before", "After")
table.mcnemar
Yes | No | |
---|---|---|
Before | 5 | 3 |
After | 12 | 3 |
To run the McNemar test in R, we can use the mcnemar.test
function. This function takes a 2x2 contingency table as its input and returns the p-value of the test, as well as the confidence interval for the difference in proportions.
# run the McNemar test
res.mcnemar <- mcnemar.test(table.mcnemar)
# Print the summary of running this test
res.mcnemar
# As usual, the output is just a list with a buch of results
names(res.mcnemar)
McNemar's Chi-squared test with continuity correction data: table.mcnemar McNemar's chi-squared = 4.2667, df = 1, p-value = 0.03887
As you can see, assuming the usual Type I error $\alpha=0.05$, we should reject the null that says that there is no signifcant changes in the proportions before and after the experiment.