#!/usr/bin/env python
# coding: utf-8

# # Getting Started Working with Twitter Data Using jq

# [jq](https://stedolan.github.io/jq/) is a command line JSON processor that's helpful for working with JSON data from Twitter. You'll want to [download and install jq](https://stedolan.github.io/jq/download/) on your system to use this notebook with your data. You could also use [jqplay](https://jqplay.org/) to try out these jq statements. 
# 
# This notebook works with tweets collected from the Twitter filterstream API using an earlier version of [Social Feed Manager](http://go.gwu.edu/sfm), but there are lots of tools to get data from the Twitter APIs. To use this notebook with your own data, set the path to your data file as DATA.
# 
# As background, Twitter streaming API data is line-oriented JSON, meaning one tweet in JSON format per line. Output from tools such as [twarc](https://github.com/edsu/twarc) is also often line-oriented JSON. 
# 
# This notebook is intended to help people getting started with working with Twitter data using jq. There are many additional software libraries available to do further analysis, including within a notebook. As an example, see [Cody Buntain's notebook](http://nbviewer.jupyter.org/github/cbuntain/TwitterFergusonTeachIn/blob/master/session_05.ipynb) analyzing #Ferguson tweets as part of the Researching Ferguson Teach-In at MITH in 2015.
# 
# We use jq a lot in working with students and faculty at [GW Libraries](https://library.gwu.edu). Do you have useful jq statements we could share here? We welcome suggestions and improvements to this notebook, via [Github](https://github.com/gwu-libraries/notebooks/tree/master/20160407-twitter-analysis-with-jq), Twitter ([@liblaura](https://twitter.com/liblaura), [@dankerchner](https://twitter.com/dankerchner), [@justin_littman](https://twitter.com/justin_littman) or email (lwrubel at gwu dot edu).

# In[1]:


DATA="data/tweets"


# ## Basic filtering

# View the JSON data, both keys and values, in a prettified format. I'm using the `head` command to just show the first tweet in the file. Alternatively, you can use `cat` to look at the whole file. 

# In[2]:


get_ipython().system("head -1 $DATA | jq '.'")


# View just the __values__ of each field, without the labels:

# In[3]:


get_ipython().system("head -1 $DATA| jq '.[]'")


# Filter your data down to __specific fields__:

# In[4]:


get_ipython().system("head -3 $DATA | jq '[.created_at, .text]'")


# The Twitter API documentation describes the responses from the [streaming](https://dev.twitter.com/streaming/overview) (e.g. filter, sample) and [REST](https://dev.twitter.com/rest/public) (user timeline, search) APIs. 
# 
# JSON is __hierarchical__, and the `created_at` and `text` fields are at the top level of the tweet. Some fields in a tweet have additional fields within them. For example, the `user` field contains fields with information about the user who tweeted, including a count of their followers, location, and a unique id (id_str):

# In[5]:


get_ipython().system("head -1 $DATA | jq '[.user]'")


# To filter for a __subset of the `user` fields__, use dot notation:

# In[6]:


get_ipython().system("head -2 $DATA | jq '[.user.screen_name, .user.name, .user.followers_count, .user.id_str]'")


# Some fields occur multiple times, such as hashtags and mentions. Pull out the hashtag text fields and put them __together into one field__, separated by commas:

# In[7]:


get_ipython().system('cat $DATA | jq \'[([.entities.hashtags[].text] | join(","))]\'')


# ## Output to CSV

# A common use of jq is to turn your __JSON data into a csv file__ to load into other analysis software. The -r option (--raw-output) formats the field as a string suitable for csv, as opposed to a JSON-formatted string with quotes. 

# In[8]:


get_ipython().system("head -8 $DATA | jq -r '[.id_str, .created_at, .text] | @csv'")


# You probably want to write that data to a file, however:

# In[9]:


get_ipython().system("cat $DATA | jq -r '[.id_str, .created_at, .text] | @csv' > tweets.csv")


# In[10]:


get_ipython().system('head tweets.csv')


# Some fields, particularly the text of a tweet, have __newline characters__. This can be a problem with your csv, breaking a tweet across lines. Substitute all occurrences of the newline character (\n) with a space:

# In[11]:


get_ipython().system('cat $DATA | jq -r \'[.id_str, .created_at, (.text | gsub("\\n";" "))] | @csv\' > tweets-oneline.csv')


# In[12]:


get_ipython().system('head tweets-oneline.csv')


# ### Output to JSON

# If you'd like JSON format as your output, you can specify the keys in the JSON objects created in the output:

# In[2]:


get_ipython().system("cat $DATA | jq -c '{{id: .id_str, user_id: .user.id_str, screen_name: .user.screen_name, created_at: .created_at, text: .text, user_mentions: [.entities.user_mentions[]?.screen_name], hashtags: [.entities.hashtags[]?.text], urls: [.entities.urls[]?.expanded_url]}}' > newtweets.json")


# In[19]:


get_ipython().system('head newtweets.json')


# In[ ]: