Notebook

Topic modelling¶

This tutorial show how you can perform topic modeling.

However, please keep in mind that the case studies merely aim to exemplify ways in which R can be used in language-based research - rather than providing detailed procedures on how to do corpus-based research.

Topic modeling is a technique used in data analysis to identify and extract topics or themes from a large collection of texts. It is a way to automatically identify patterns and insights in large amounts of unstructured data.

Topic modeling is useful in a variety of fields, such as social sciences, humanities, and marketing research. For example, it can be used to analyze customer reviews to identify common themes, to study political speeches to identify key issues and topics, or to analyze social media data to understand public opinion on a particular topic.

The technique works by analyzing the frequency of words that appear in a text corpus and grouping them into topics that frequently co-occur. These topics can then be interpreted and labeled based on the words that are most strongly associated with them. The result is a set of topics that represent the most important themes in the text corpus.

Preparation and session set up¶

Activate required packages.

In [ ]:

# load packages
library(here)
library(tidyr)
library(quanteda.textstats)
library(quanteda.textplots)
library(seededlda)

Loading data¶

To perform topic modelling, we first need to load some data. In this tutorial, we will use the essays written by German and Spanish learners of English provided in the International Corpus of Learner English (ICLE).

Loading corpus data into R consists of two steps:

create a list of paths of the corpus files
loop over these paths and load the data in the files identified by the paths.

To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the ICLE data is in a folder called ICLE).

In [ ]:

# load ace files
corpusfiles <- list.files(here::here("ICLE"), # path to the corpus data
                       pattern = "GE|SP",
                       # full paths - not just the names of the files
                       full.names = T) 
# load the files by scanning the content
corpus <- sapply(corpusfiles, function(x){
  x <- scan(x, what = "char",  sep = "", quote = "",  quiet = T,  skipNul = T)
  x <- paste0(x, sep = " ", collapse = " ")
  x <- stringr::str_squish(x)
})
# inspect
str(corpus)

Using your own data¶

To use your own data, click on the folder called `MyTexts` (it is in the menu to the left of the screen) and then simply drag and drop your txt-files into the folder.
When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.
You can upload only txt-files (simple unformatted files created in or saved by a text editor)!
The notebook assumes that you upload some form of text data - not tabular data!

In [ ]:

# load function that helps loading texts
source("https://slcladal.github.io/rscripts/loadtxts.R")
# load texts
text <- loadtxts("notebooks/MyTexts")
# inspect the structure of the text object
str(text)

Cleaning and tokenising¶

We start by cleaning the corpus data by removing tags, artefacts and non-alpha-numeric characters.

In [ ]:

corpus_clean <- 
  # IF YOU ARE USING YOUR OWN DATA: 
  # REPLACE "corpus" WITH "text" IN THE LINE BELOW
  stringr::str_remove_all(corpus, "<.*?>") %>%
  # remove superfluous white spaces
  stringr::str_squish()
# inspect
substr(corpus_clean[1], start=1, stop=200)

We now split the clean corpora into individual words.

In [ ]:

toks_corpus <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks_corpus <- tokens_remove(toks_corpus, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst"))
dfmat_corpus <- dfm(toks_corpus) %>% 
              dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile",
                       max_docfreq = 0.1, docfreq_type = "prop")

Unsupervised LDA¶

Now that we have cleaned the data, we can perform the topic modelling. This consists of two steps:

First, we perform an unsupervised LDA. We do this to check what topics are in our corpus.
Then, we perform a supervised LDA (based on the results of the unsupervised LDA) to identify meaningful topics in our data. For the supervised LDA, we define so-called seed terms that help in generating coherent topics.

Here we look for 15 topics but we would vary the number of topics (k) to check what topics are in our data.

In [ ]:

# set seed
set.seed(1234)
# generate model: change k to different numbers, e.g. 10 or 20 and look for consistencies in the keywords for the topics below.
tmod_lda <- seededlda::textmodel_lda(dfmat_corpus, k = 15)
# inspect
terms(tmod_lda, 10)

Supervised LDA¶

Now, we perform a supervised LDA. Here we use the keywords extracted based on the unsupervised LDA as seed terms for topics to create coherent topics.

IMPORTANT: If you are using your own data, you need to change and adapt the topics and keywords defined below (simply replace the topics and seed terms with your own topics and seed terms (based on the results of the unsupervised LDA!).

In [ ]:

# semisupervised LDA
dict <- dictionary(list(Computer = c("computers", "information", "machine", "computer"),
                        Education = c("students", "courses", "education", "university"),
                        Movies = c("movie", "film", "commercial", "watch"),
                        Family = c("parents", "home", "mother", "father"),
                        War = c("war", "peace", "somalia", "consequences"),
                        Foreigners = c("foreigners", "germans", "turkish", "turks"),
                        Phone = c("phone", "telephone", "call"),
                        Food = c("mcdonald's", "restaurant", "chips", "fastfood", "taste"),
                        Pets = c("dog*", "walk", "cat", "pet", "happy"),
                        Eco = c("green", "car*", "drive", "speed", "accident", "exhaust"),
                        Dating = c("girl", "boy", "date", "merries")))
tmod_slda <- textmodel_seededlda(dfmat_corpus, dict, residual = TRUE, min_termfreq = 10)
terms(tmod_slda)

We can now inspect topics by file. This shows what text contains which topic.

In [ ]:

topics(tmod_slda)[1:20]

Now, we can extract files and create a data frame of topics and documents. This shows what topic is dominant in which file in tabular form.

In [ ]:

files <- stringr::str_replace_all(names(topics(tmod_slda)), ".*/(.*?).txt", "\\1")
topics <- topics(tmod_slda)
language <- ifelse(stringr::str_detect(files, "GE"), "German", "Spanish")
df <- data.frame(language, topics) %>%
  dplyr::mutate_if(is.character, factor)
# inspect
head(df)

Exporting data¶

To export a data frame as an MS Excel spreadsheet, we use write_xlsx. Be aware that we use the here function to save the file in the current working directory.

In [ ]:

# save data for MyOutput folder
write_xlsx(dfp, here::here("notebooks/MyOutput/df.xlsx"))

You will find the generated MS Excel spreadsheet named "df.xlsx" in the `MyOutput` folder (located on the left side of the screen).

Simply double-click the `MyOutput` folder icon, then right-click on the "df.xlsx" file, and choose Download from the dropdown menu to download the file.

To visualize the results, we summaries the table to show the percentage of topics by language.

In [ ]:

dfp <- df %>%
  dplyr::group_by(language, topics) %>%
  dplyr::summarise(freq = n()) %>%
  dplyr::group_by(language) %>%
  dplyr::mutate(all = sum(freq),
                percent = round(freq/all*100, 2))
# inspect
head(dfp)

In a final step, we visualize the results.

In [ ]:

dfp %>%
  ggplot(aes(x = topics, y = percent, label = percent, fill = language)) +
  geom_bar(stat = "identity", position = position_dodge()) + 
  geom_text(vjust=-0.3, position = position_dodge(0.9)) + 
  theme_bw() +
  coord_cartesian(ylim = c(0, 30)) +
  labs(x = "Topic", y = "Percent") +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 90))

Exporting images¶

To export network graph as an png-file, we use ggsave. Be aware that we use the here function to save the file in the MyOutput folder.

The ggsave function has the following main arguments:

filename: File name to create on disk.
device: Device to use. Can either be a device function (e.g. png), or one of "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If NULL (default), the device is guessed based on the filename extension
path: Path of the directory to save plot to: path and filename are combined to create the fully qualified file name. Defaults to the working directory.
width, height: Plot size in units expressed by the units argument. If not supplied, uses the size of the current graphics device.
units: One of the following units in which the width and height arguments are expressed: "in", "cm", "mm" or "px".
dpi: Plot resolution. Also accepts a string input: "retina" (320), "print" (300), or "screen" (72). Applies only to raster output types.
bg: Background colour. If NULL, uses the plot.background fill value from the plot theme.

In [ ]:

# save network graph for MyOutput folder
ggsave(here::here("notebooks/MyOutput/image_01.png"), bg = "white")

You will find the image-file named *image_01.png* in the `MyOutput` folder (located on the left side of the screen).

Simply double-click the `MyOutput` folder icon, then right-click on the *image_01.png* file, and choose Download from the dropdown menu to download the file.

Outro¶

We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.

In [ ]:

sessionInfo()

Back to HOME