#!/usr/bin/env python # coding: utf-8 # # COVID-19 Discussion in Albertan Subreddits # # Reddit is a popular forum for the discussion of local issues. Every day, Albertans submit hundreds of submissions and thousands of comments across the [`/r/alberta`](https://www.reddit.com/r/alberta/), [`/r/edmonton`](https://www.reddit.com/r/Edmonton/), and [`/r/calgary`](https://www.reddit.com/r/Calgary/) subreddits. Comments from these subreddits offer a unique look into the thoughts of Albertans as we navigate the COVID-19 pandemic. # # In this notebook we will use data from comments made in the three major Albertan subreddits to track the emergence of the COVID-19 pandemic as a major factor in the lives of Albertans. To determine if comments are relevant to a set of topics related to the pandemic, we will use an unsupervised text classification model. For more in-depth implementation details see the [accompanying code in this repo](https://github.com/epsalt/reddit-c19-analysis). # # The final result is the figure below, which shows the frequency of comments related to topics relevant to the pandemic: # In[1]: import json import altair as alt ## alt.renderers.enable('mimetype') with open("assets/chart.json") as f: spec = json.load(f) alt.Chart.from_dict(spec) # | Label | Date | Event | # |:-----:|:---------:|:----------------------------------------| # | A |2020-01-15 | Canada's first case | # | B |2020-03-05 | Alberta's first case | # | C |2020-03-17 | Canada Declares Public Health Emergency | # ## The Pushshift Reddit Dataset # # Data for this project was compiled using the [Pushshift](https://pushshift.io/) API. Pushshift is a social media data collection platform that has archived Reddit data since 2015. For more information, see [*The Pushshift Reddit Dataset*](https://arxiv.org/abs/2001.08435). For the Python code used to request data from the Pushshift API, see [pushshift.py](https://github.com/epsalt/reddit-c19-analysis/blob/master/pushshift.py). # # Data compiled for this project included 22682 submissions and 487072 comments from the `/r/alberta`, `/r/calgary`, and `/r/edmonton` subreddits between January 1, 2020 and May 1, 2020. Only comment data was used for this project but future work could incorporate submissions into the analysis. # # |Subreddit | Submissions | Comments | # |-----------|-------------|------------| # | Alberta | 4934 | 123827 | # | Calgary | 10279 | 229931 | # | Edmonton | 7469 | 133314 | # | **Total** | **22682** | **487072** | # The compiled data in compressed `jsonl` format can be downloaded here: # - Comments (68MB): https://alberta-reddit-data.s3-us-west-2.amazonaws.com/coms.jsonl.gz # - Submissions (9MB): https://alberta-reddit-data.s3-us-west-2.amazonaws.com/subs.jsonl.gz # ## Running this notebook # # This Jupyter notebook can be run on your own machine by following these steps: # # ```bash # # Clone the repo # $ git clone https://github.com/epsalt/reddit-c19-analysis # $ cd reddit-c19-analysis # # # Install dependencies # $ pip install -r requirements.txt # $ python -m spacy download en_core_web_sm # # # Download comment data # $ curl https://alberta-reddit-data.s3-us-west-2.amazonaws.com/coms.jsonl.gz -o coms.jsonl.gz # $ gunzip -d coms.jsonl.gz -c > data/coms.jsonl # # # Or use `pushshift.py` to request data from the pushshift API # # - Requests are rate limited, so this can take a while # # - Date ranges or subreddits can be changed in the source # $ python pushshift.py # # # Run the notebook # $ jupyter notebook c19-reddit-alberta.ipynb # ``` # # This notebook was run on an Arch Linux machine with an AMD Ryzen 5 2600X CPU 3.60GHz (6 cores) and 16 GB memory. Running the entire notebook takes around 5 minutes. # In[2]: from IPython.display import Markdown, HTML # ## Preprocessing comment text # # Before training the model, comment text first needs to be preprocessed. Reddit comments are messy: they can include emojis, misspellings, URLS, and other comments embedded as quotes. # In[3]: # An example messy comment # www.reddit.com/r/Edmonton/comments/fo4ne7/when_do_i_start_to_stay_home/fliscp9/ comment = ">They could mean that if you get good rest you won't show symptoms in many cases.\n\nI'm assuming they aren't stupid and therefore aren't actually proposing that getting a good enough sleep will actually cause you to be asymptomatic for any disease, let alone COVID-19.\n\n> Confirmed cases are when people have been tested. \n\nGiven that so many countries are currently only testing people with symptoms and are not even routinely testing asymptomatic front-line workers, I am okay with an assumption that \"confirmed cases\" is fairly equivalent \"confirmed symptomatic cases\".\n\nAlso, this is my source for \"80% of cases are mild\":\n\n[https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200301-sitrep-41-covid-19.pdf?sfvrsn=6768306d\\_2](https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200301-sitrep-41-covid-19.pdf?sfvrsn=6768306d_2)\n\n> Among 44672 patients in China with confirmed infection, 2.1% were below the age of 201. The most commonly reported symptoms included fever, dry cough, and shortness of breath,and most patients (80%) experienced mild illness. Approximately14% experienced severe disease and 5% were critically ill. \n\nNote that \"confirmed infection\" terminology here and that any number of asymptomatic people in this particular sample was < 1%.\n\nSo, claiming that I am making a big assumption here seems unwarranted. I certainly didn't make this up; I am quoting the data from a WHO report on a large sample. Certainly as we learn more have more data and have some replicated studies that perform blanket testing in large populations, we might find that asymptomatic cases are indeed high. I am open to that possibility." Markdown(comment) # In[4]: from analysis import regex_replace, tokenize docs = regex_replace(comment) # Regex substitution to remove comment replies, links, non-ascii chars tokens = tokenize([docs])[0] # Tokenization and lemmatization with spacy tokens # In[5]: get_ipython().run_cell_magic('time', '', '\nimport dask\nfrom analysis import preprocess\n\n# Preprocess comment text\nwith dask.config.set(scheduler="processes"):\n df = preprocess("data/coms.jsonl")\n\nsentences = [str(doc).split() for doc in df["tokens"].to_list()]\ndf.head()\n') # ## Word2vec model # # After the comment text has been preprocessed, we can use it to train a model. For this project we are using a text classification model called [`word2vec`](https://en.wikipedia.org/wiki/Word2vec) which produces word embeddings, a vector representation of textual data. With a corpus of ~400k preprocessed comments and a vector size of 300 the model took about 2 minutes to train using the [gensim](https://radimrehurek.com/gensim/models/word2vec.html) `word2vec` implementation. # In[6]: get_ipython().run_cell_magic('time', '', '\nfrom model import W2vModel # gensim wrapper\n\n# Train the model\nmodel = W2vModel()\nmodel.train(sentences)\nmodel.save("models")\n') # Some checks to make sure word similarities make sense: # In[7]: # Find words similar to 'covid' model.ft.wv.similar_by_word("covid") # In[8]: # Find words similar to 'mask' model.ft.wv.similar_by_word("mask") # In[9]: # Check similarity of some word pairings # More similar = higher score pairs = [("covid-19", "coronavirus"), # related ("dog", "pandemic"), # not related ("cat", "dog"), # related ("house", "turkey"), # not related ("trudeau", "notley")] # related for pair in pairs: similarity = model.ft.wv.similarity(*pair) print(f"{', '.join(pair).ljust(21)} {similarity:10.5f}") # ## Comment similarity # # Now that we have a trained model, we can use it to classify. To track discussion of topics related to the COVID-19 pandemic we are going to use six groups of topic keywords. These were selected manually with the aid of `word2vec` word similarity scores. Future enhancements could utilize [Latent Dirichlet allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) to discover topics. # In[10]: import pandas as pd with open("data/terms.json") as f: terms = json.load(f) table = pd.DataFrame.from_dict(terms, orient="index", columns=["keywords"]).to_html() HTML(table) # In[11]: get_ipython().run_cell_magic('time', '', "\nimport warnings\nwarnings.filterwarnings('ignore') # TODO: investigate div/0 errors\n\nqueries = terms.keys()\nquery_tokens = [doc.split() for doc in tokenize(list(terms.values()))]\n\n# Calculate soft cosine similarity between topic query and each comment\nfor query, token in zip(queries, query_tokens):\n df[query] = model.similarity(token, sentences)\n\ndf.head()\n") # ## Visualizing topic discussion frequency # # Once a similarity score for each topic has been calculated we can aggregate and visualize the results: # In[12]: from analysis import aggregate # Aggregate scores by submission day and subreddit # Score = Count(similarity > threshold) / Count(Total) agg = ( aggregate(df, threshold=0.30) .reset_index() .assign(cat=lambda x: x["cat"].str.capitalize()) .assign(date=lambda x: x["created_utc"]) ) # In[13]: # Load timeline data for context timeline = pd.read_json("data/timeline.json", convert_dates=True) tl = timeline.assign(cat=lambda x: [list(terms.keys())] * len(x)).explode("cat") # In[14]: # Build Altair plot plt = ( alt.Chart(title="Chart title") .mark_line() .encode( x=alt.X("date", axis=alt.Axis(title=None)), y=alt.Y("score:Q", axis=alt.Axis(title="Freq")), color=alt.Color( "subreddit:O", legend=alt.Legend(title="Subreddit"), scale=alt.Scale(scheme="tableau10"), ), ) ) tlc = alt.Chart(timeline).mark_rule().encode(x="date") labels = tlc.mark_text(align="left", baseline="top", dx=7).encode( text="label", y=alt.value(5) ) plot = ( (plt + tlc + labels) .properties(width=330, height=75) .facet( alt.Facet("cat:N", title=None), data=agg, title="Topic Discussion Frequency Across Albertan Subreddits", columns=2, ) .resolve_scale(x="independent") ) plot # | Label | Date | Event | # |:-----:|:---------:|:----------------------------------------| # | A |2020-01-15 | Canada's first case | # | B |2020-03-05 | Alberta's first case | # | C |2020-03-17 | Canada Declares Public Health Emergency | # ## Conclusions # # By visualizing how topic discussion has changed over time we can start to understand how Albertans have reacted to the COVID-19 pandemic. Here are some observations: # # - Discussion of the pandemic was rare until after Alberta's first case on March 5th. # - Shortages and hoarding were a major concern for about 10 days. After suppliers dealt with shortages, discussion decreased significantly. # - Ideas about social distancing did not enter the public discourse until two weeks after discussion of COVID-19 peaked. # - Discussion frequency in the `/r/calgary` and `/r/edmonton` subreddits was similar, despite differing COVID-19 case rates between the cities. # - The economy has continued to be a constant topic of disucssion in Albertan subreddits throughout the pandemic. Economic discussion frequency is higher in `/r/alberta` than in `/r/calgary` and `/r/edmonton`. # # Reddit comments proved to be a useful source of data for measuring the magnitude of local conversation on topics related to the COVID-19 pandemic. A text classification model trained with social media data to understand local issues could be a helpful tool for government bodies and non-profits. For example, municipal governments could use a trained model to automatically classify 311 complaints with locally relevant keywords. # ## Future Work # # - Integrate more Canadian cities into the analysis # - Statistical discovery of topics via [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) # - Interactive visualization of word vectors with the [tensorboard embedding projector](http://projector.tensorflow.org/) # - Explore using a pre-trained model as a foundation before training with local subreddit comment data #