#!/usr/bin/env python # coding: utf-8 # # Getting Started Working with Twitter Data Using jq # [jq](https://stedolan.github.io/jq/) is a command line JSON processor that's helpful for working with JSON data from Twitter. You'll want to [download and install jq](https://stedolan.github.io/jq/download/) on your system to use this notebook with your data. You could also use [jqplay](https://jqplay.org/) to try out these jq statements. # # This notebook works with tweets collected from the Twitter filterstream API using an earlier version of [Social Feed Manager](http://go.gwu.edu/sfm), but there are lots of tools to get data from the Twitter APIs. To use this notebook with your own data, set the path to your data file as DATA. # # As background, Twitter streaming API data is line-oriented JSON, meaning one tweet in JSON format per line. Output from tools such as [twarc](https://github.com/edsu/twarc) is also often line-oriented JSON. # # This notebook is intended to help people getting started with working with Twitter data using jq. There are many additional software libraries available to do further analysis, including within a notebook. As an example, see [Cody Buntain's notebook](http://nbviewer.jupyter.org/github/cbuntain/TwitterFergusonTeachIn/blob/master/session_05.ipynb) analyzing #Ferguson tweets as part of the Researching Ferguson Teach-In at MITH in 2015. # # We use jq a lot in working with students and faculty at [GW Libraries](https://library.gwu.edu). Do you have useful jq statements we could share here? We welcome suggestions and improvements to this notebook, via [Github](https://github.com/gwu-libraries/notebooks/tree/master/20160407-twitter-analysis-with-jq), Twitter ([@liblaura](https://twitter.com/liblaura), [@dankerchner](https://twitter.com/dankerchner), [@justin_littman](https://twitter.com/justin_littman) or email (lwrubel at gwu dot edu). # In[1]: DATA="data/tweets" # ## Basic filtering # View the JSON data, both keys and values, in a prettified format. I'm using the `head` command to just show the first tweet in the file. Alternatively, you can use `cat` to look at the whole file. # In[2]: get_ipython().system("head -1 $DATA | jq '.'") # View just the __values__ of each field, without the labels: # In[3]: get_ipython().system("head -1 $DATA| jq '.[]'") # Filter your data down to __specific fields__: # In[4]: get_ipython().system("head -3 $DATA | jq '[.created_at, .text]'") # The Twitter API documentation describes the responses from the [streaming](https://dev.twitter.com/streaming/overview) (e.g. filter, sample) and [REST](https://dev.twitter.com/rest/public) (user timeline, search) APIs. # # JSON is __hierarchical__, and the `created_at` and `text` fields are at the top level of the tweet. Some fields in a tweet have additional fields within them. For example, the `user` field contains fields with information about the user who tweeted, including a count of their followers, location, and a unique id (id_str): # In[5]: get_ipython().system("head -1 $DATA | jq '[.user]'") # To filter for a __subset of the `user` fields__, use dot notation: # In[6]: get_ipython().system("head -2 $DATA | jq '[.user.screen_name, .user.name, .user.followers_count, .user.id_str]'") # Some fields occur multiple times, such as hashtags and mentions. Pull out the hashtag text fields and put them __together into one field__, separated by commas: # In[7]: get_ipython().system('cat $DATA | jq \'[([.entities.hashtags[].text] | join(","))]\'') # ## Output to CSV # A common use of jq is to turn your __JSON data into a csv file__ to load into other analysis software. The -r option (--raw-output) formats the field as a string suitable for csv, as opposed to a JSON-formatted string with quotes. # In[8]: get_ipython().system("head -8 $DATA | jq -r '[.id_str, .created_at, .text] | @csv'") # You probably want to write that data to a file, however: # In[9]: get_ipython().system("cat $DATA | jq -r '[.id_str, .created_at, .text] | @csv' > tweets.csv") # In[10]: get_ipython().system('head tweets.csv') # Some fields, particularly the text of a tweet, have __newline characters__. This can be a problem with your csv, breaking a tweet across lines. Substitute all occurrences of the newline character (\n) with a space: # In[11]: get_ipython().system('cat $DATA | jq -r \'[.id_str, .created_at, (.text | gsub("\\n";" "))] | @csv\' > tweets-oneline.csv') # In[12]: get_ipython().system('head tweets-oneline.csv') # ### Output to JSON # If you'd like JSON format as your output, you can specify the keys in the JSON objects created in the output: # In[2]: get_ipython().system("cat $DATA | jq -c '{{id: .id_str, user_id: .user.id_str, screen_name: .user.screen_name, created_at: .created_at, text: .text, user_mentions: [.entities.user_mentions[]?.screen_name], hashtags: [.entities.hashtags[]?.text], urls: [.entities.urls[]?.expanded_url]}}' > newtweets.json") # In[19]: get_ipython().system('head newtweets.json') # In[ ]: