This tutorial presents a case study on swearing in Irish English based on the Irish component of the International Corpus of English (ICE-Ireland, Version 1.2.2).
This case study represents a corpus-based study of sociolinguistic variation that aims to answer if swearing differs across social groups in Irish English. In particular, this case study analyzes if speakers coming from different age groups and genders, i.e. whether old or young or men or women, differ in their use of swear words based on a sub-sample of the Irish component of the International Corpus of English (ICE). .
Activate required packages.
# load packages
library(dplyr)
library(stringr)
library(ggplot2)
library(openxlsx)
library(quanteda)
library(cfa)
library(here)
library(tidyr)
Loading corpus data consists of two steps:
create a list of paths of the corpus files
loop over these paths and load the data in the files identified by the paths.
To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the corpus data is in a folder called Corpus in the data sub-folder of your Rproject folder).
corpusfiles <- list.files(here::here("ICEIreland"), # path to the corpus data
# file types you want to analyze, e.g. txt-files
pattern = ".*.txt",
# full paths - not just the names of the files
full.names = T)
# inspect
head(corpusfiles)
You can then use the sapply
function to loop over the paths and load the data int R using e.g. the scan
function as shown below. In addition to loading the file content, we also paste all the content together using the paste0
function and remove superfluous white spaces using the str_squish
function from the stringr
package.
corpus <- sapply(corpusfiles, function(x){
x <- scan(x,
what = "char",
sep = "",
quote = "",
quiet = T,
skipNul = T)
x <- paste0(x, sep = " ", collapse = " ")
x <- stringr::str_squish(x)
})
# inspect
str(corpus)
Once you have loaded your data into R, you can then continue with processing and transforming the data according to your needs.
You can also use your own data. You can see below what you need to do to upload and use your own data.
To be able to load your own data, you need to click on the folder symbol to the left of the screen:
Then, click on the New Folder
symbol and create a new folder and call it MyData
.
Then click on the upload symbol and upload your files into the MyData
folder.
Select and upload the files you want to analyze (IMPORTANT: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.
myfiles <- list.files(here::here("MyData"), # path to the corpus data
# full paths - not just the names of the files
full.names = T)
# load colt files
mycorpus <- sapply(myfiles, function(x){
x <- scan(x,
what = "char",
sep = "",
quote = "",
quiet = T,
skipNul = T)
x <- paste0(x, sep = " ", collapse = " ")
x <- stringr::str_squish(x)
})
# inspect
str(mycorpus)
Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!
Now that the corpus data is loaded, we can prepare the searches by defining the search patterns. We will use regular expressions to retrieve all variants of the swear words. The sequence \\b
denotes word boundaries while the sequence [a-z]{0,3}
means that the sequences ass can be followed by a string consisting of any character symbol that is maximally three characters long (so that the search would also retrieve asses). We separate the search patterns by |
as this means or.
searchpatterns <- c("\\bass[es]{0,2}\\b|\\basshole[s]{0,1}\\b|\\bbitch[a-z]{0,3}\\b|\\b[a-z]{0,}fuck[a-z]{0,3}\\b|\\bshit[a-z]{0,3}\\b|\\bcock[a-z]{0,1}\\b|\\bwanker[a-z]{0,3}\\b|\\bboll[io]{1,1}[a-z]{0,3}\\b|\\bcrap[a-z]{0,3}\\b|\\bbugger[a-z]{0,3}\\b|\\bcunt[a-z]{0,3}\\b")
After defining the search pattern(s), we extract the kwics (keyword(s) in context) of the swear words.
# extract kwic
kwicswears <- quanteda::kwic(tokens(corpus), searchpatterns, window = 10, valuetype = "regex")
# inspect data
head(kwicswears)
The kwic
function has the following schema:
kwic(x, pattern, window = 5, valuetype = c("glob", "regex", "fixed"), separator = " ", case_insensitive = TRUE, index = NULL)
The arguments (or parameters) of the kwic
function mean:
x
: a character, corpus, or tokens objectpattern
: a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.window
: the number of context words to be displayed around the keywordvaluetype
: the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.separator
: a character to separate words in the outputcase_insensitive
: logical; if TRUE, ignore case when matching a pattern or dictionary valuesindex
: an index object to specify keywordsWe now clean the kwic so that it is easier to see the relevant information.
kwicswearsclean <- kwicswears %>%
as.data.frame() %>%
dplyr::rename("File" = colnames(.)[1],
"PreviousContext" = colnames(.)[4],
"Token" = colnames(.)[5],
"FollowingContext" = colnames(.)[6]) %>%
dplyr::select(-from, -to, -pattern) %>%
dplyr::mutate(File = str_remove_all(File, ".*/"),
File = stringr::str_remove_all(File, " .*"))
# inspect data
head(kwicswearsclean)
We now create another kwic but with much more context because we want to extract the speaker that has uttered the swear word. To this end, we remove everything that proceeds the $
symbol as the speakers are identified by characters that follow the $
symbol, remove everything that follows the >
symbol which end the speaker identification sequence, remove remaining white spaces, and convert the remaining character to upper case.
# extract kwic
Speaker <- kwic(tokens(corpus), searchpatterns, window = 1000, valuetype = "regex") %>%
as.data.frame() %>%
dplyr::mutate(Speaker = str_remove_all(pre, ".*\\$"),
Speaker = str_remove_all(Speaker, ">.*"),
Speaker = str_squish(Speaker),
Speaker = toupper(Speaker)) %>%
dplyr::pull(Speaker)
# inspect results
head(Speaker)
We now add the Speaker to our initial kwic. This way, we combine the swear word kwic with the speaker and as we already have the file, we can use the file plus speaker identification to check if the speaker was a man or a woman.
swire <- kwicswearsclean %>%
dplyr::mutate(Speaker = Speaker)
# inspect data
head(swire)
Now, we inspect the extracted swear word tokens to check if our search strings have indeed captured swear words.
# convert tokens to lower case
swire <- swire %>%
dplyr::mutate(Token = tolower(Token))
# inspect tokens
table(swire$Token)
FUCK and its variants is by far the most common swear word in our corpus. However, we do not need the type of swear word to answer our research question and we thus summarize the table to show which speaker in which files has used how many swear words.
swire <- swire %>%
dplyr::mutate(File = stringr::str_remove_all(File, " .*")) %>%
dplyr::group_by(File, Speaker) %>%
dplyr::summarise(Swearwords = n())
# inspect data
head(swire)
Now that we extract how many swear words the speakers in the corpus have used, we can load the biodata of the speakers.
# load bio data
bio <- openxlsx::read.xlsx(here::here("speakerinfo/SpeakerInfoICEIreland.xlsx"))
# inspect data
head(bio)
To save data from a notebook, you need to save the data and then you can simply download it. The code below saves the data to the folder in which you currently are (NOTE: the date will be deleted after the session closes though!).
openxlsx::write.xlsx(bio, here::here("SpeakerInfoICEIreland.xlsx"))
In a next step, we rename column names so that they match the column names in the concordance data (swire).
bio <- bio %>%
dplyr::rename(File = text.id,
Speaker = spk.ref,
Gender = sex,
Age = age,
Words = word.count) %>%
dplyr::select(File, Speaker, Gender, Age, Words)
# inspect data
head(bio)
In a next step, we combine the table with the speaker information with the table showing the swear word use.
# combine frequencies and biodata
swire <- dplyr::left_join(bio, swire, by = c("File", "Speaker")) %>%
# replace NA with 0
dplyr::mutate(Swearwords = ifelse(is.na(Swearwords), 0, Swearwords),
File = factor(File),
Speaker = factor(Speaker),
Gender = factor(Gender),
Age = factor(Age))
# inspect data
head(swire)
We now clean the table by removing speakers for which we do not have any information on their age and gender. Also, we summarize the table to extract the mean frequencies of swear words (per 1,000 words) by age and gender.
# clean data
swire_vis <- swire %>%
dplyr::filter(is.na(Gender) == F,
is.na(Age) == F,
Age != "0-18") %>%
dplyr::group_by(Age, Gender) %>%
dplyr::summarise(SumWords = sum(Words),
SumSwearwords = sum(Swearwords),
FrequencySwearwords = round(SumSwearwords/SumWords*1000, 3))
# inspect data
head(swire_vis)
Now that we have prepared our data, we can plot swear word use by gender.
swire_vis %>%
ggplot(aes(x = Age, y = FrequencySwearwords, group = Gender, fill = Gender)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_bw() +
scale_fill_manual(values = c("orange", "darkgrey")) +
labs(y = "Relative frequency \n swear words per 1,000 words")
ggsave(here::here("swearing_ire.png"))
The graph suggests that the genders do not differ in their use of swear words except for the age bracket from 26 to 41: men swear more among speakers aged between 26 and 33 while women swear more between 34 and 41 years of age.
We now perform a statistical test, e.g. a Configural Frequency Analysis (CFA) to check if specifically which groups in the data significantly over and under-use swearwords.
cfa_swear <- swire %>%
dplyr::filter(is.na(Gender) == F,
is.na(Age) == F,
Age != "0-18") %>%
dplyr::group_by(Age, Gender) %>%
dplyr::summarise(SumWords = sum(Words),
SumSwearwords = sum(Swearwords)) %>%
tidyr::gather(Type, Frequency, SumWords:SumSwearwords)
# inspect data
head(cfa_swear, 20)
We now select only Age, Gender, and Type and store the counts ina separate vector (or object).
# define configurations
configs <- cfa_swear %>%
dplyr::select(Age, Gender, Type)
# define counts
counts <- cfa_swear$Frequency
Now that configurations and counts are separated, we can perform the configural frequency analysis.
# perform cfa
cfa(configs,counts)$table %>%
as.data.frame() %>%
dplyr::filter(p.chisq < .1,
stringr::str_detect(label, "Swear")) %>%
dplyr::select(-z, -p.z, -sig.z, -sig.chisq, -Q)
After filtering out significant over use of non-swear words from the results of the CFA, we find that men and women in the age bracket 26 to 33 use significantly more swear words and other groups in the data.
It has to be borne in mind, though, that this is merely a case study and that a more fine-grained analysis on a substantially larger data set were necessary to get a more reliable impression.
We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.
sessionInfo()