So far we have focused only on data preparation, without running any analysis or touching the stm package, which is what will happen here. In these paragraphs, we will initially focus on some diagnostic techniques for the identification of an adequate number of topics, which is a key step, and a question to which there is no "right" answer - or rather the answer is very context and needs-specific (see the stm package vignette or the article by Roberts et al. 2016). Secondly, we will perform some basic analysis on the resulting model, in order to understand more about it and to explore the potentialities of the stm package.
Note: the notebook is better rendered here on nbviewer .
In the manual of the stm R package the authors give some very generic rules of thumb in with regard to the number of topics:
For short corpora focused on very specific subject matter (such as survey experiments) 3-10 topics is a useful starting range. For small corpora (a few hundred to a few thousand) 5-50 topics is a good place to start. Beyond these rough guidelines it is application specific. Previous applications in political science with medium sized corpora (10k to 100k documents) have found 60-100 topics to work well. For larger corpora 100 topics is a useful default size. Of course, your mileage may vary.
Ours is a small and short corpora on a relatively specific subject matter, so we shall focus at least initially on the 5-50 range. We will now explore some basic diagnostics to identify a properly fit model.
For the sake of the present exercise, we will focus on a model with only the factor by which the salary is computed as topic prevalence covariate. This was chosen by reasons of simplicity, as the other covariates need some further massaging, so we will work with them at a later stage.
The "normal" dataframe needs to be converted into a matrix of word indeces and counts before being fed to the function. stm accepts different formats, among which quanteda dfm objects or sparse matrix, but here we will just use the native stm format prepared with the package functions textProcessor and prepDocuments:
suppressWarnings(library(stm)) suppressWarnings(library(stringr)) DF<-read.table("totaljobsCoordRates.txt") DF<-DF[!is.na(DF$Description),]# stm doesn't work with missing data DF<-DF[!is.na(DF$rateby),] DF$Description = str_replace_all(DF$Description, "/"," ")# substitute "/" with a space DF$Description = str_replace_all(DF$Description, "'|"|/", "'") ## links and other eventual html encoding (adapted from https://juliasilge.com/blog/evaluating-stm/) DF$Description = str_replace_all(DF$Description, "<a(.*?)>", " ") ## DF$Description = str_replace_all(DF$Description, ">|<|&", " ") ## DF$Description = str_replace_all(DF$Description, "&#[:digit:]+;", " ") ## DF$Description = str_remove_all(DF$Description, "<[^>]*>") ## textProcessor prepares the corpus, representing the documents as lists containting #word indicese and associated word counts, the vocab character vector associated with #the word indices and a metadata matrix containing the covariates. As we can remove some words, #here we opt to remove "work" and "will" from the corpus processed<-textProcessor(DF$Description, metadata = DF, customstopwords=c("work", "will"), verbose=FALSE) ##PrepDocuments is a helpful function to perform some manipulations like removing words based on thresholds #without compromising the indexes out<-prepDocuments(processed$documents, processed$vocab, processed$meta, verbose=FALSE) docs<-out$documents vocab<-out$vocab meta<-out$meta
stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help. Papers, resources, and other materials at structuraltopicmodel.com
As prepDocuments removes words based on thresholds, it might be helfpul to have a cursory look at them. I found also helpful the below procedure to identify a specific document(s) where a specific word has been used.
#identify documents where the word "heartland" has been used wntword<-which(processed$vocab=="heartland") filterdocs<-lapply(processed$documents, function(ch) grep(wntword, ch)) indexList<-filterdocs[sapply(filterdocs, function(x) length(x) > 0)] DF[as.numeric(names(indexList)),2:3]
|199||Branch Customer Adviser - Hexham (Part Time)||The Society’s branch network represents the face of Newcastle Building Society on the high street in our Heartland the North East of England. The main purpose of the branches is the provision of savings and mortgage accounts underpinned by solid Financial Advice for those who live in and around our branch locations, that are spread right across the North East of England, Cumbria and the Scottish borders. We deliver these services whilst providing excellent customer service. A priority for the branch network is maintaining great relationships with both existing and potential customers through maintaining regular customer engagement and being there to help them understand and address their ever changing financial needs. Role Summary This is a part-time position, 21 hours per week, Monday to Wednesday, 9am - 5pm (plus Saturday 8:45am-12:00pm, to be worked as overtime on a rota basis). You will consistently adopt a proactive and positive approach with our customers, in order to achieve the required business result. You will also understand how best to help customers by applying the branch understanding needs process, suggesting options customers may consider and identifying leads for the relevant advice specialists. You will be responsible for till operation, cash management, branch administration, servicing, balancing and cash replenishment. You will efficiently complete all administration in relation to customer enquiries, leads and sales, utilising the Societies customer management systems. Skills and Experience As a Customer Adviser, you will be working as part of an already successful team that provides 1st class, market leading customer service. You will possess great communication skills that will result in high levels of customer satisfaction and customer advocacy for use of our products and services. A proven ability to work as part of a team and deliver the high standards we require is essential.|
As mentioned above, with regards to the number of topics we should focus at least initially on the 5-50 range. stm allows to run some preliminary diagnostics in order to assess the models, in particular with the function searchK which performs a number of tests, like held-out likelihood, residual analysis, average exclusivity and semantic coherence. The results then can be easily plotted just calling plot on the resulting searchK object. Those preferring to work with quanteda or sparse matrices can have a look at this post by Julia Silge.
Here we will run a test on 5,10,15, 20 and 50 topics (note that the held-out is re-computed randomly with each run, so even with the same seed the results might be different).
storage1<-searchK(docs, vocab, K = c(5,10,15,20, 50), prevalence=~ rateby, data=meta,set.seed(9999), verbose=FALSE)
print(storage1$results) options(repr.plot.width=6, repr.plot.height=6) plot(storage1)
K exclus semcoh heldout residual bound lbound em.its 1 5 8.420103 -34.18562 -6.610060 1.709334 -445128.2 -445123.4 80 2 10 8.984677 -40.22166 -6.524543 1.498724 -432460.5 -432445.4 95 3 15 9.191316 -45.02971 -6.498164 1.451809 -425034.6 -425006.7 131 4 20 9.320732 -50.82168 -6.523601 1.427977 -419447.1 -419404.8 78 5 50 9.558343 -61.14846 -6.649663 2.512571 -397968.2 -397819.7 84
Within certain boundaries, it seems that the choice of model is a matter of trade-offs. In our case, the best results seem to be in the range 10-20. It can be helpful to compare then semantic coherence to exclusivity, as models with fewer topics have higher semantic coherence for more topics, but lower exclusivity. To check for it however we have to fit the models first, which is what we do next. We will set the initiatlization method to the default "Spectral", as advised by the author of the package, although alternatives are available (the vignette offers further information about the different methods of initialization). Also in this case the post of Julia Silge mentioned above presents an alternative procedure.
model10Prrateby<-stm(documents=out$documents, vocab=out$vocab, prevalence =~ rateby, K=10, data=out$meta, init.type = "Spectral", verbose=FALSE) model15Prrateby<-stm(documents=out$documents, vocab=out$vocab, prevalence =~ rateby, K=15, data=out$meta, init.type = "Spectral", verbose=FALSE) model20Prrateby<-stm(documents=out$documents, vocab=out$vocab, prevalence =~ rateby, K=20, data=out$meta, init.type = "Spectral", verbose=FALSE)
As mentioned above, we check first exclusivity against semantic coherence per topic per model. stm includes the function topicQuality to plot them per each model, but the visual result tends to be confusing and cannot plot multiple models together, so in this occasion we will proceed with ggplot:
suppressWarnings(library(ggplot2)) suppressWarnings(library(htmlwidgets)) M10ExSem<-as.data.frame(cbind(c(1:10),exclusivity(model10Prrateby), semanticCoherence(model=model10Prrateby, docs), "Mod10")) M15ExSem<-as.data.frame(cbind(c(1:15),exclusivity(model15Prrateby), semanticCoherence(model=model15Prrateby, docs), "Mod15")) M20ExSem<-as.data.frame(cbind(c(1:20),exclusivity(model20Prrateby), semanticCoherence(model=model20Prrateby, docs), "Mod20")) ModsExSem<-rbind(M10ExSem, M15ExSem, M20ExSem) colnames(ModsExSem)<-c("K","Exclusivity", "SemanticCoherence", "Model") ModsExSem$Exclusivity<-as.numeric(as.character(ModsExSem$Exclusivity)) ModsExSem$SemanticCoherence<-as.numeric(as.character(ModsExSem$SemanticCoherence)) options(repr.plot.width=7, repr.plot.height=6, repr.plot.res=100) plotexcoer<-ggplot(ModsExSem, aes(SemanticCoherence, Exclusivity, color = Model))+geom_point(size = 2, alpha = 0.7) + geom_text(aes(label=K), nudge_y=.04)+ labs(x = "Semantic coherence", y = "Exclusivity", title = "Comparing exclusivity and semantic coherence") plotexcoer
The three models appear fairly similar in this regard, with the model with 20 topics having two outliers with relatively lower semantic coherence and the model with 10 topics with the tendence to have lower exclusivity. For the sake of the present exercise we will then proceed with the model with 15 topics, which appears to be a good compromise. On the other hand, the other two models can be considered good enough and in another situation they might be preferrable. As mentioned above, the choice of the model can be very context-specific.
Next, we want to know a bit better our model. stm stores the document-topic proportions and the topic-word distributions in two matrices, $\theta$ (which is also refferred to, somewhat confusingly, as $\gamma$) and $\beta$.
We shall start having a look at $\theta$, which can be called directly from the model, or more conveniently with the built-in function make.dt, which allows to load also the metadata (which in our case is helpful considering we used rateby as prevalence co-variate).
topicprop15<-make.dt(model15Prrateby, meta) topicprop15[1:3,c(1:18, 27)]#visualize proportions, job title and rateby
|1||0.011672020||0.023733125||0.0003485387||8.557529e-02||0.0105680973||0.0072089073||0.0020970495||0.1369500597||0.713769152||2.393135e-05||0.002362233||0.0000192910||0.0032763198||0.0006509655||0.001745020||1||Dementia Friendly Communities Officer||year|
Consulting the table might be quite cumbersome unless you want to have a look at the topic proportions of a specific document. stm offers the plot.STM function with "hist" as argument in order to visualize the estimates of document-topic proportions (Julia Silge in another post offers a different procedure using ggplot to otbain a similar plot, I have included this in the appendix):
options(repr.plot.width=7, repr.plot.height=7) plot.STM(model15Prrateby, "hist")
This plot basically tells us which topics are coming from which documents. As expected, each topic has no relation or very little relation with several documents. I personally found also helpful to plot the topic distribution per document, using tidytext and ggplot:
suppressWarnings(library(tidytext))# the package sometimes is not loaded correctly. If this happens, you might have to re-start the kernel td_theta <- tidytext::tidy(model15Prrateby, matrix = "theta") selectiontdthteta<-td_theta[td_theta$document%in%c(1:15),]#select the first 30 documents. be careful to select a sensible interval, as attempting to load a very huge corpus might crash the kernel thetaplot1<-ggplot(selectiontdthteta, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) + geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) + facet_wrap(~ document, ncol = 3) + labs(title = "Theta values per document", y = expression(theta), x = "Topic") options(repr.plot.width=8, repr.plot.height=7, repr.plot.res=100) thetaplot1
As an example, we see here that document 9 is strongly associated with topic 15 and document 15 with topic 8, whereas documents 10 or 13 are more mixed.
Next, we want to understand more about each topic - what are they really about. If we go back to the $\beta$ matrix, we can have a more analytical look at the word frequencies per topic. The matrix stores the log of the word probabilities for each topic, and plotting it can give us a good overall understanding of the distribution of words per topic (the procedure below is mutuated from the second post of Julia Silge ):
suppressWarnings(library(dplyr)) suppressWarnings(library(drlib))#drlib is available on github and needs devtools to be installed td_beta <- tidytext::tidy(model15Prrateby) options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100) td_beta %>% group_by(topic) %>% top_n(10, beta) %>% ungroup() %>% mutate(topic = paste0("Topic ", topic), term = reorder_within(term, beta, topic)) %>% ggplot(aes(term, beta, fill = as.factor(topic))) + geom_col(alpha = 0.8, show.legend = FALSE) + facet_wrap(~ topic, scales = "free_y") + coord_flip() + scale_x_reordered() + labs(x = NULL, y = expression(beta), title = "Highest word probabilities for each topic", subtitle = "Different words are associated with different topics")
Attaching package: 'dplyr' The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union
In this case I found pretty helpful to go and have a more detailed look at the word distribution within each topic (the plot above focus only on the top 10 words for each topic):
betaT1<-td_beta %>% mutate(topic = paste0("Topic ", topic), term = reorder_within(term, beta, topic)) %>%filter(topic=="Topic 1")#beta values for topic 1 betaplotT1<-ggplot(betaT1[betaT1$beta>0.003,], aes(term, beta, fill = as.factor(topic))) + geom_bar(alpha = 0.8, show.legend = FALSE, stat = "Identity")+coord_flip()+labs(x ="Terms", y = expression(beta), title = "Word probabilities for Topic 1")#plot word probabilities higher than 0.003 for topic 1 options(repr.plot.width=9, repr.plot.height=10, repr.plot.res=100) betaplotT1
Also for topic exploration the stm package offers some nice built-in functions that make analysis easier. In order to identify the content of each topic, we can use again plot.STM, this time with argument "summary" to visualize the topic distribution (which topics are overall more common), with most common words for each topic (as we will see later there are several options for the words to be visualized, but here we will leave the default).
options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100) plot.STM(model15Prrateby, "summary", n=5)# distribution and top 5 words per topic
labelTopics (or sageLabels) gives us more detailed insights on the popular words in each topic. As mentioned above, other than the highest probability, we can visualize the FREX words (FREX weights words by frequency and exclusivity to the topic), lift words (frequency divided by frequency in other topics), and score (similar to lift, but with log frequencies). The vignette and the manual of stm offer more details on this aspect. plot.STM can again be used with "labels" as argument to plot the words in a more visually appealing way (although I found the final result not always satisfactory).
This type of analyis can be helpful in particular to make comparisons between topics and understand more which differences there are between them. As an example, topics 5,6,9 seem to have a degree of overlap, as they are all focusing on customer care, so we might want to check them in more detail:
labelTopics(model15Prrateby, topics=c(5,6,9), n=10)# complete list of top 10 words per topics 5-6-9 par(mfrow=c(1,1), mar=c(0,2,0,0)) options(repr.plot.width=5, repr.plot.height=3, repr.plot.res=100) plot.STM(model15Prrateby, "labels", topics=c(5,6,9), label="frex", n=10, width=55)#top 10 FREX words per topics 5-6-9
Topic 5 Top Words: Highest Prob: custom, team, store, servic, sale, work, product, hour, role, retail FREX: store, retail, sale, love, kitchen, brand, product, great, discount, price Lift: ‘can, afford, ambassador, here, high-street, lowest, merchandis, pub, sister, tv’s Score: store, sale, saver, custom, superdrug, retail, bike, fashion, kitchen, love Topic 6 Top Words: Highest Prob: custom, role, skill, excel, client, administr, offic, manag, team, experi FREX: order, invoic, purchas, supplier, administr, detail, client, manufactur, tax, verbal Lift: bill, bullet, coleman, courier, expedit, fork, forklift, inventori, law, louis Score: sale, purchas, invoic, custom, manufactur, adecco, stock, ledger, angel, global Topic 9 Top Words: Highest Prob: custom, ’ll, servic, make, need, role, help, can, ’re, work FREX: ’ll, ’re, just, everyon, thing, don’t, realli, want, legal, get Lift: bag, colour, discrimin, forget, moment, offici, optic, perhap, perk, prejudic Score: ’ll, ’re, custom, realli, don’t, rta, everyon, thing, showroom, addleman
We can also have a glimpse at highly representative documents per each topic with findThoughts, and plot them with plotQuote (although this might give best results with shorter documents):
thoughts5 <- findThoughts(model15Prrateby,texts=DF$Description, topics=5, n=3)$docs[]# take 3 representative documents per topic 5 thoughts6 <- findThoughts(model15Prrateby,texts=DF$Description, topics=6, n=3)$docs[]# take 3 representative documents per topic 6 thoughts9 <- findThoughts(model15Prrateby,texts=DF$Description, topics=9, n=3)$docs[]# take 3 representative documents per topic 9 options(repr.plot.width=10, repr.plot.height=12, repr.plot.res=100) par(mfrow=c(1,4), mar=c(0,0,2,2)) plotQuote(thoughts5, width=30, maxwidth=500, text.cex=1.25, main="Topic 5") plotQuote(thoughts6, width=30, maxwidth=500, text.cex=1.25, main="Topic 6") plotQuote(thoughts9, width=30, maxwidth=500, text.cex=1.25, main="Topic 9")
We might assume that topic 5 is more related to retail, with positions like store assistant, topic 6 closer to back office and sales positions and topic 9 somewhere in between. It is worth noting that several words in topic 9 seem to have weak semantic content (like "'ll", "'re", "'just"), and we might consider to re-run the model taking them out as stopwords. If we go back to the distribution of estimates, very few documents seem to have a strong association with topic 9, and if we recover the original exclusivity-semantic coherence plot, we note that topic 9 has the lowest exclusivity in the model we are using:
options(repr.plot.width=7, repr.plot.height=6, repr.plot.res=100) ggplot(ModsExSem[ModsExSem$Model=="Mod15",], aes(SemanticCoherence, Exclusivity, color = Model))+geom_point(size = 2, alpha = 0.7) + geom_text(aes(label=K), nudge_x=.05, nudge_y=.05)+ labs(x = "Semantic coherence", y = "Exclusivity", title = "Comparing exclusivity and semantic coherence for model with 15 topics")
Another very helpful function to use for topic analysis is again plot.STM with "perspectives" as argument, wich allows us to have a graphical display of topical contrasts, where words are plotted with size proportional to their use within the topic and oriented along the X-axis based on how much they favor one topic against the other (the vertical configuration is random):
options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100) par(mfrow=c(2,2), mar=c(0,0,2,2)) plot.STM(model15Prrateby, "perspectives", topics=c(5,9)) plot.STM(model15Prrateby, "perspectives", topics=c(5,6)) plot.STM(model15Prrateby, "perspectives", topics=c(6,9))
Our supposition about the differences between topic 6 and 5 seems confirmed, while the topic 9 is confirmed to have a weak semantic content. We will probably have to re-run the model taking out some additional stopwords. However, we can first conclude this overview of the main functions of stm having a look at the difference made by the inclusion of "rateby" as a topic prevalence co-variate. As mentioned in a previous post, the Structural Topic Model allows the analysis of relationships with document metadata, in form of co-variates, either in terms of the degree of association of a document to a topic, either of the association of a word to a topic. The function to extract the relationship and associated uncertaintuy on all the topics of the model is estimateEffect, which basically estimates a regression with documents as units, the outcome is the proportion of each document about a topic in an STM model and the covariates are document-meta data:
out$meta$rateby<-as.factor(out$meta$rateby) prep<-estimateEffect(1:15~ rateby, model15Prrateby, metadata=out$meta, uncertainty="Global")#nsim is defaulted to 25, but on a small model a higher number lead to more consistent results summary(prep, topics=c(1:3), nsim=1000)# summary of regression on topic 1-3
Call: estimateEffect(formula = 1:15 ~ rateby, stmobj = model15Prrateby, metadata = out$meta, uncertainty = "Global") Topic 1: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03983 0.01380 2.885 0.00411 ** ratebyday 0.03308 0.04554 0.726 0.46801 ratebyhour 0.06448 0.02136 3.019 0.00269 ** ratebymonth -0.02540 0.18667 -0.136 0.89184 ratebyweek -0.04007 0.05885 -0.681 0.49628 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Topic 2: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.038904 0.014162 2.747 0.00627 ** ratebyday -0.026867 0.040058 -0.671 0.50278 ratebyhour 0.034647 0.019766 1.753 0.08036 . ratebymonth 0.008482 0.210999 0.040 0.96795 ratebyweek -0.038638 0.053918 -0.717 0.47401 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Topic 3: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.021089 0.011812 1.785 0.074928 . ratebyday 0.453879 0.049058 9.252 < 2e-16 *** ratebyhour 0.007749 0.015948 0.486 0.627310 ratebymonth -0.020844 0.147313 -0.141 0.887546 ratebyweek 0.220659 0.065674 3.360 0.000851 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For reasons of space here we printed only the first three topics, but overall it seems that rateby doesn't have significant effectes, with the only exception of topic 3 (educational jobs) with the covariates week and day. The results of estimateEffect can also be plotted with a variety of methods, pointestimate being probably the best for factor variables as in our case:
options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100) plot.estimateEffect(prep, model=model15Prrateby, covariate="rateby", topics=3, method="pointestimate", xlim=c(-.5,1))
We can basically infer that for offers related to topic 3 the rate tend to be computed by day rather than by other factors. As mentioned at the beginning, other covariates might offer more insightful results, and rateby was selected just for simplicity reasons.
In this document we have seen some diagnostic techniques to select a number of topics for a stm model, and gave overview of the main functions offered by the stm R package. There are a number of additional add-on packages and functions that allow further analysis, but we will explore them in a latter stage. Some of them are mentioned in the paragraph below.
Based on the results of this work, we identified a model with 15 topics as a good fit, but we also concluded that we should re-run the model on a corpus from where more stopwords are excluded. We will proceed in this way in the next entry of this series, where we will give some more depth to our analysis. In particular, we will include further variables to the analysis, explore the use of topical content co-variates, and focus more on topic correlations.
We recomend to read in detail the package vignette, where some of the functions used here are presented in more detail. The theoretical aspects of the model are explained in more detail in Roberts et al., "A model of text for experimentation in the social sciences", 2016. The website of the authors of the model and the package offers an extensive repository of resources, including supporting packages for analysis and visualization, as well as a list of scientific articles adopting stm. Julia Silge has authored two interesting posts (one and two) where stm is used adopting tidy data principles.
The procedure below has been used by Julia Silge to obtain the distribution of document probabilities per each topic, in fashion similar to plot.STM with "hist" as argument. It might be helpful if using a tidy data approach, or wants to have a higher degree of control on the visualization.
ggplot(td_theta, aes(gamma, fill = as.factor(topic))) + geom_histogram(alpha = 0.8, show.legend = FALSE) + facet_wrap(~ topic, ncol = 3) + labs(title = "Distribution of document probabilities for each topic", y = "Number of documents", x = expression(theta))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.