#!/usr/bin/env python
# coding: utf-8

# # Recipes for processing Twitter data with jq
# This notebook is a companion to [Getting Started Working with Twitter Data Using jq](http://nbviewer.jupyter.org/github/gwu-libraries/notebooks/blob/master/20160407-twitter-analysis-with-jq/Working-with-twitter-using-jq.ipynb). It focuses on recipes that the [Social Feed Manager](http://gwu-libraries.github.io/sfm-ui/) team has used when preparing datasets of tweets for researchers.
# 
# We will continue to add additional recipes to this notebook. If you have any suggestions, please [contact us](http://gwu-libraries.github.io/sfm-ui/contact).
# 
# This notebook requires at least [jq](https://stedolan.github.io/jq/) 1.5. Note that only earlier versions may be available from your package manager; manual installation may be necessary.
# 
# These recipes can be used with any data source that outputs tweets as line-oriented JSON. Within the context of SFM, this is usually the output of [twitter_rest_warc_iter.py or twitter_stream_warc_iter.py](http://sfm.readthedocs.io/en/latest/processing.html#warc-iterators) within a [processing container](http://sfm.readthedocs.io/en/latest/processing.html#processing-in-container). Alternatively, [Twarc](https://github.com/DocNow/twarc) is a commandline tool for retrieving data from the Twitter API that outputs tweets as line-oriented JSON.
# 
# For the purposes of this notebook, we will use a line-oriented JSON file that was created using Twarc. It contains the user timeline of @SocialFeedMgr. The command used to produce this file was `twarc.py --timeline socialfeedmgr > tweets.json`.
# 
# For an explanation of the fields in a tweet see the [Tweet Field Guide](https://dev.twitter.com/overview/api/tweets). For other helpful tweet processing utilities, see [twarc utils](https://github.com/DocNow/twarc/tree/master/utils).
# 
# For the sake of brevity, some of the examples may only output a subset of the tweets fields and/or a subset of the tweets contained in `tweets.json`. The following example outputs the tweet id and text of all of the first 5 tweets.
# 

# In[1]:


get_ipython().system("head -n5 tweets.json | jq -c '[.id_str, .text]'")


# ## Dates
# For both filtering and output, it is often necessary to parse and/or normalize the `created_at` date. The following shows the original `created_at` date and the date as an ISO 8601 date.

# In[2]:


get_ipython().system('head -n5 tweets.json | jq -c \'[.created_at, .created_at | strptime("%A %B %d %T %z %Y") | todate]\'')


# ## Filtering

# ### Filtering text
# #### Case sensitive

# In[3]:


get_ipython().system('cat tweets.json | jq -c \'select(.text | contains("blog")) | [.id_str, .text]\'')


# In[4]:


get_ipython().system('cat tweets.json | jq -c \'select(.text | contains("BLOG")) | [.id_str, .text]\'')


# #### Case insensitive
# To ignore case, use a [regular expression filter](https://stedolan.github.io/jq/manual/#RegularexpressionsPCRE) with the case-insensitive flag.

# In[5]:


get_ipython().system('cat tweets.json | jq -c \'select(.text | test("BLog"; "i")) | [.id_str, .text]\'')


# #### Filtering on multiple terms (OR)

# In[6]:


get_ipython().system('cat tweets.json | jq -c \'select(.text | test("BLog|twarc"; "i")) | [.id_str, .text]\'')


# #### Filtering on multiple terms (AND)

# In[7]:


get_ipython().system('cat tweets.json | jq -c \'select((.text | test("BLog"; "i")) and (.text | test("twitter"; "i"))) | [.id_str, .text]\'')


# ### Filter dates
# The following shows tweets created after November 5, 2016.

# In[8]:


get_ipython().system('cat tweets.json | jq -c \'select((.created_at | strptime("%A %B %d %T %z %Y") | mktime) > ("2016-11-05T00:00:00Z" | fromdateiso8601)) | [.id_str, .created_at, (.created_at | strptime("%A %B %d %T %z %Y") | todate)]\'')


# ### Is retweet

# In[9]:


get_ipython().system('cat tweets.json | jq -c \'select(has("retweeted_status")) | [.id_str, .retweeted_status.id]\'')


# ### Is quote

# In[10]:


get_ipython().system('cat tweets.json | jq -c \'select(has("quoted_status")) | [.id_str, .quoted_status.id]\'')


# ## Output
# To write output to a file use `> <filename>`. For example: `cat tweets.json | jq -r '.id_str' > tweet_ids.txt`

# ### CSV
# Following is a CSV output that has fields similar to the CSV output produced by [SFM's export functionality](http://sfm.readthedocs.io/en/latest/quickstart.html#exports).
# 
# Note that is uses the `-r` flag for jq instead of the `-c` flag.
# 
# Also note that is it is necessary to remove line breaks from the tweet text to prevent it from breaking the CSV. This is done with `(.text | gsub("\n";" "))`.

# In[11]:


get_ipython().system('head -n5 tweets.json | jq -r \'[(.created_at | strptime("%A %B %d %T %z %Y") | todate), .id_str, .user.screen_name, .user.followers_count, .user.friends_count, .retweet_count, .favorite_count, .in_reply_to_screen_name, "http://twitter.com/" + .user.screen_name + "/status/" + .id_str, (.text | gsub("\\n";" ")), has("retweeted_status"), has("quoted_status")] | @csv\'')


# #### Header row
# The header row should be written to the output file with `>` before appending the CSV with `>>`.

# In[12]:


get_ipython().system('echo "[]" | jq -r \'["created_at","twitter_id","screen_name","followers_count","friends_count","retweet_count","favorite_count","in_reply_to_screen_name","twitter_url","text","is_retweet","is_quote"] | @csv\'')


# #### Splitting files
# Excel can load CSV files with over a million rows. Howver, for practical purposes a much smaller number is recommended.
# 
# The following uses the split command to split the CSV output into multiple files. Note that the flags accepted may be different in your environment.
# 
# ```
# cat tweets.json | jq -r '[.id_str, (.text | gsub("\n";" "))] | @csv' | split --lines=5 -d --additional-suffix=.csv - tweets
# ls *.csv
# tweets00.csv  tweets01.csv  tweets02.csv  tweets03.csv  tweets04.csv
# tweets05.csv  tweets06.csv  tweets07.csv  tweets08.csv  tweets09.csv
# ```
# 
# `--lines=5` sets the number of lines to include in each file.
# 
# `--additional-suffix=.csv` set the file extension.
# 
# `tweets` is the base name for each file.

# ### Tweet ids
# When outputting tweet ids, `.id_str` should be used instead of `.id`. See [Ed Summer's blog post](http://inkdroid.org/2016/11/30/overflow/) for an explanation.

# In[13]:


get_ipython().system("head -n5 tweets.json | jq -r '.id_str'")


# In[ ]: