In this tutorial, we are going to see how to determine whether there is a significant association between two categorical variables.

$\chi^2$ test for independence¶

The $\chi^2$ test for independence is a statistical test based on the $\chi^2$ statistic, which compares the observed frequencies in a contingency table with the expected frequencies under the assumption that the two variables are independent. In this tutorial, we will learn how to perform the $\chi^2$ test for independence in R.

To illustrate this in this tutorial, we will employ again the dataset we've just used in the previous tutorial on plotting categorical data. These data contain the results from a clinical trial, in which it was administered a drug to some participants and a placebo to others, and their progress was monitored. At the end of the study, it was determined whether the participants' progress had worsened, remained the same, or improved.

In [1]:

dat.tutorial<-read.csv("https://raw.githubusercontent.com/jrasero/cm-85309-2023/main/datasets/tutorial8chisquare.csv")

To perform the $\chi^2$ test for independence, we first need to create a contingency table of the two categorical variables. We can do this using the table function as follows:

In [2]:

table(dat.tutorial$group,dat.tutorial$evolution)

         
          Better Same Worse
  Drug        30   28    11
  Placebo     25   44    33

Alternatively, you may also produce the same table combining formula with xtabs:

In [3]:

xtabs(~group + evolution, dat.tutorial)

         evolution
group     Better Same Worse
  Drug        30   28    11
  Placebo     25   44    33

In both cases, the contingency table displays the observed frequencies of the group and evolution variables. For instance, 30 participants received the drug and exhibited a better evolution, 28 had no change in their evolution, and 11 experienced a worse evolution. Among those who received the placebo, 25 experienced an improvement in their evolution, 44 had no change, and 33 experienced worsened their evolution.

Now, once we have got our contingency table, we have everything that is need to conduct the $\chi^2$ test in R. This is done using the chisq.test function, which takes a contingency table as input and returns the $\chi^2$ statistic, the degrees of freedom, and the p-value.

For example, to conduct a $\chi^2$ test for independence on the contingency table that we created above:

In [4]:

chisq.test(table(dat.tutorial$group,dat.tutorial$evolution))
chisq.test(xtabs(~group + evolution, dat.tutorial))

	Pearson's Chi-squared test

data:  table(dat.tutorial$group, dat.tutorial$evolution)
X-squared = 8.976, df = 2, p-value = 0.01124

	Pearson's Chi-squared test

data:  xtabs(~group + evolution, dat.tutorial)
X-squared = 8.976, df = 2, p-value = 0.01124

As we can see, the p-value ~ 0.01, which if we assume a type I error rate $\alpha = 0.05$, we should reject the null, that is, it seems that there is an association between receiveing the drug or not and the change in health evolution.

How is this computed? Well, the $\chi^2$ test for independence uses both the observed frequencies and the expected ones to compute the $\chi^2$ statistics. Fortunately, R does all these computations when calling the function chisq.test:

In [5]:

res<-chisq.test(table(dat.tutorial$group,dat.tutorial$evolution))
names(res)

'statistic'
'parameter'
'p.value'
'method'
'data.name'
'observed'
'expected'
'residuals'
'stdres'

In [6]:

# The observed frequencies (i.e. the table we input)
res$observed

# The expected frequencies
res$expected

         
          Better Same Worse
  Drug        30   28    11
  Placebo     25   44    33

A matrix: 2 × 3 of type dbl
	Better	Same	Worse
Drug	22.19298	29.05263	17.75439
Placebo	32.80702	42.94737	26.24561

Fisher's exact test¶

The Fisher's exact test is a statistical test used to determine whether there is a significant association between two categorical variables. It is particularly useful when the sample size is small and the expected cell counts are less than 5, given that the $\chi^2$ test is not recommended in this kind of scenarios.

The Fisher's exact test is based on the hypergeometric distribution and computes the probability of observing a particular table given the marginal totals:

$$p = \frac{\left( a + b \right)! \left( c + d \right)! \left( a + c \right)! \left( b + d \right)!}{a! b! c! d! \left( a + b + c + d \right)!}$$

The null hypothesis of Fisher's exact test is that there is no association between the two variables.

N.B. This test can only be used with 2x2 contingency tables.

Let's create a 2x2 contingency table for running this demonstration:

In [7]:

fisher.table<-matrix(c(5,1, 2,7), nrow=2, byrow=TRUE)

colnames(fisher.table)<-c("A", "B")
rownames(fisher.table)<-c("a", "b")
fisher.table

A matrix: 2 × 2 of type dbl
	A	B
a	5	1
b	2	7

Let's see first what happens when we try to run a $\chi^2$ test on this table:

In [8]:

res.chisq<-chisq.test(fisher.table)
res.chisq

Warning message in chisq.test(fisher.table):
“Chi-squared approximation may be incorrect”

	Pearson's Chi-squared test with Yates' continuity correction

data:  fisher.table
X-squared = 3.2254, df = 1, p-value = 0.0725

As you can see, there is an warning message indicating that this test might not be correct. This appears to be due to the fact that the expected occurrences in some of the cells are below 5. Let's check it out:

In [9]:

res.chisq$expected

A matrix: 2 × 2 of type dbl
	A	B
a	2.8	3.2
b	4.2	4.8

In these cases, a Fisher's exact test might be a more appropriate test choice. In order to perform this kind of test in R, you can use the fisher.test function, which takes also a contingency table as its input and returns the p-value of the test, as well as the odds ratio and its confidence interval.

In [10]:

res.fisher<-fisher.test(fisher.table)
res.fisher

	Fisher's Exact Test for Count Data

data:  fisher.table
p-value = 0.04056
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   0.8646648 934.0087368
sample estimates:
odds ratio 
  13.59412

Here, the p-value ~ 0.04, which is below the usual type I error $\alpha=0.05$ and therefore we should reject the null hypothesis that there is no association between both categorical variables.

This example is interesting, because had we used the $\chi^2$ test with the same level of significance, the conclusion regarding the null hypothesis would have been different:

In [11]:

# p-value running chisquare 
res.chisq$p.value

# p-value running fisher
res.fisher$p.value

0.0725020254219294

0.0405594405594406

Nevertheless, the fact that both p-values are so close to the assumed Type I error $\alpha=0.05$ should make us be cautious when drawing any conclusion. For example, while a p-value of 0.051 is technically above the threshold, and a p-value of 0.049 is below it, in practical terms the difference between them is negligible. Therefore, we should not attach too much significance to a p-value that is only slightly below the threshold, nor dismiss one that is only slightly above it.

McNemar test¶

The McNemar test is a statistical test used to determine if there is a significant difference between paired proportions. It is commonly used to compare the frequency of occurrence of a particular outcome before and after an intervention or treatment. The test is based on the binomial distribution and assumes that the data are PAIRED. If your data are not paired, you should use a different test, such as the $\chi^2$ test for independence or the Fisher's exact test.

N.B. This test can only be used with 2x2 contingency tables.

In [12]:

# Create a 2x2 contingency table
table.mcnemar <- matrix(c(5, 3, 12, 3), nrow = 2, byrow = TRUE)
colnames(table.mcnemar) <- c("Yes", "No")
rownames(table.mcnemar) <- c("Before", "After")

table.mcnemar

A matrix: 2 × 2 of type dbl
	Yes	No
Before	5	3
After	12	3

To run the McNemar test in R, we can use the mcnemar.test function. This function takes a 2x2 contingency table as its input and returns the p-value of the test, as well as the confidence interval for the difference in proportions.

In [13]:

# run the McNemar test
res.mcnemar <- mcnemar.test(table.mcnemar)

# Print the summary of running this test
res.mcnemar

# As usual, the output is just a list with a buch of results
names(res.mcnemar)

	McNemar's Chi-squared test with continuity correction

data:  table.mcnemar
McNemar's chi-squared = 4.2667, df = 1, p-value = 0.03887

'statistic'
'parameter'
'p.value'
'method'
'data.name'

As you can see, assuming the usual Type I error $\alpha=0.05$, we should reject the null that says that there is no signifcant changes in the proportions before and after the experiment.