#!/usr/bin/env python # coding: utf-8 # # Recipes for processing Twitter data with jq # This notebook is a companion to [Getting Started Working with Twitter Data Using jq](http://nbviewer.jupyter.org/github/gwu-libraries/notebooks/blob/master/20160407-twitter-analysis-with-jq/Working-with-twitter-using-jq.ipynb). It focuses on recipes that the [Social Feed Manager](http://gwu-libraries.github.io/sfm-ui/) team has used when preparing datasets of tweets for researchers. # # We will continue to add additional recipes to this notebook. If you have any suggestions, please [contact us](http://gwu-libraries.github.io/sfm-ui/contact). # # This notebook requires at least [jq](https://stedolan.github.io/jq/) 1.5. Note that only earlier versions may be available from your package manager; manual installation may be necessary. # # These recipes can be used with any data source that outputs tweets as line-oriented JSON. Within the context of SFM, this is usually the output of [twitter_rest_warc_iter.py or twitter_stream_warc_iter.py](http://sfm.readthedocs.io/en/latest/processing.html#warc-iterators) within a [processing container](http://sfm.readthedocs.io/en/latest/processing.html#processing-in-container). Alternatively, [Twarc](https://github.com/DocNow/twarc) is a commandline tool for retrieving data from the Twitter API that outputs tweets as line-oriented JSON. # # For the purposes of this notebook, we will use a line-oriented JSON file that was created using Twarc. It contains the user timeline of @SocialFeedMgr. The command used to produce this file was `twarc.py --timeline socialfeedmgr > tweets.json`. # # For an explanation of the fields in a tweet see the [Tweet Field Guide](https://dev.twitter.com/overview/api/tweets). For other helpful tweet processing utilities, see [twarc utils](https://github.com/DocNow/twarc/tree/master/utils). # # For the sake of brevity, some of the examples may only output a subset of the tweets fields and/or a subset of the tweets contained in `tweets.json`. The following example outputs the tweet id and text of all of the first 5 tweets. # # In[1]: get_ipython().system("head -n5 tweets.json | jq -c '[.id_str, .text]'") # ## Dates # For both filtering and output, it is often necessary to parse and/or normalize the `created_at` date. The following shows the original `created_at` date and the date as an ISO 8601 date. # In[2]: get_ipython().system('head -n5 tweets.json | jq -c \'[.created_at, .created_at | strptime("%A %B %d %T %z %Y") | todate]\'') # ## Filtering # ### Filtering text # #### Case sensitive # In[3]: get_ipython().system('cat tweets.json | jq -c \'select(.text | contains("blog")) | [.id_str, .text]\'') # In[4]: get_ipython().system('cat tweets.json | jq -c \'select(.text | contains("BLOG")) | [.id_str, .text]\'') # #### Case insensitive # To ignore case, use a [regular expression filter](https://stedolan.github.io/jq/manual/#RegularexpressionsPCRE) with the case-insensitive flag. # In[5]: get_ipython().system('cat tweets.json | jq -c \'select(.text | test("BLog"; "i")) | [.id_str, .text]\'') # #### Filtering on multiple terms (OR) # In[6]: get_ipython().system('cat tweets.json | jq -c \'select(.text | test("BLog|twarc"; "i")) | [.id_str, .text]\'') # #### Filtering on multiple terms (AND) # In[7]: get_ipython().system('cat tweets.json | jq -c \'select((.text | test("BLog"; "i")) and (.text | test("twitter"; "i"))) | [.id_str, .text]\'') # ### Filter dates # The following shows tweets created after November 5, 2016. # In[8]: get_ipython().system('cat tweets.json | jq -c \'select((.created_at | strptime("%A %B %d %T %z %Y") | mktime) > ("2016-11-05T00:00:00Z" | fromdateiso8601)) | [.id_str, .created_at, (.created_at | strptime("%A %B %d %T %z %Y") | todate)]\'') # ### Is retweet # In[9]: get_ipython().system('cat tweets.json | jq -c \'select(has("retweeted_status")) | [.id_str, .retweeted_status.id]\'') # ### Is quote # In[10]: get_ipython().system('cat tweets.json | jq -c \'select(has("quoted_status")) | [.id_str, .quoted_status.id]\'') # ## Output # To write output to a file use `> `. For example: `cat tweets.json | jq -r '.id_str' > tweet_ids.txt` # ### CSV # Following is a CSV output that has fields similar to the CSV output produced by [SFM's export functionality](http://sfm.readthedocs.io/en/latest/quickstart.html#exports). # # Note that is uses the `-r` flag for jq instead of the `-c` flag. # # Also note that is it is necessary to remove line breaks from the tweet text to prevent it from breaking the CSV. This is done with `(.text | gsub("\n";" "))`. # In[11]: get_ipython().system('head -n5 tweets.json | jq -r \'[(.created_at | strptime("%A %B %d %T %z %Y") | todate), .id_str, .user.screen_name, .user.followers_count, .user.friends_count, .retweet_count, .favorite_count, .in_reply_to_screen_name, "http://twitter.com/" + .user.screen_name + "/status/" + .id_str, (.text | gsub("\\n";" ")), has("retweeted_status"), has("quoted_status")] | @csv\'') # #### Header row # The header row should be written to the output file with `>` before appending the CSV with `>>`. # In[12]: get_ipython().system('echo "[]" | jq -r \'["created_at","twitter_id","screen_name","followers_count","friends_count","retweet_count","favorite_count","in_reply_to_screen_name","twitter_url","text","is_retweet","is_quote"] | @csv\'') # #### Splitting files # Excel can load CSV files with over a million rows. Howver, for practical purposes a much smaller number is recommended. # # The following uses the split command to split the CSV output into multiple files. Note that the flags accepted may be different in your environment. # # ``` # cat tweets.json | jq -r '[.id_str, (.text | gsub("\n";" "))] | @csv' | split --lines=5 -d --additional-suffix=.csv - tweets # ls *.csv # tweets00.csv tweets01.csv tweets02.csv tweets03.csv tweets04.csv # tweets05.csv tweets06.csv tweets07.csv tweets08.csv tweets09.csv # ``` # # `--lines=5` sets the number of lines to include in each file. # # `--additional-suffix=.csv` set the file extension. # # `tweets` is the base name for each file. # ### Tweet ids # When outputting tweet ids, `.id_str` should be used instead of `.id`. See [Ed Summer's blog post](http://inkdroid.org/2016/11/30/overflow/) for an explanation. # In[13]: get_ipython().system("head -n5 tweets.json | jq -r '.id_str'") # In[ ]: