Homework

The constraints of a tutorial environment hinder the use of real-world moderately large datasets. This keeps us from a fully satisfying experience. To remedy this situation we recommend playing with the following datasets. Please wait until you're off of the conference WiFi before downloading them.

NYCTaxi

Download link

Taxi trips taken in 2013 released by a FOIA request. Around 20GB CSV uncompressed.

Try the following:

  • Use dask.dataframe with pandas-style queries
  • Store in HDF5 both with and without categoricals, measure the size of the file and query times
  • Set the index by one of the date-time columns and store in castra (also using categoricals). Perform range queries and measure speed. What size and complexity of query can you perform while still having an "interactive" experience?

Github Archive

Download link

Every public github event for the last few years stored as gzip compressed line-delimited JSON data. Watch out, the schema switches at the 2014-2015 transition.

Try the following:

  • Use dask.bag to inspect the data
  • Drill down using functions like pluck and filter
  • Find who the most popular committers were in 2015

Reddit Comments

Download link

Every publicly available reddit comment, available as a large torrent

Try the following:

  • Use dask.bag to inspect the data
  • Combine dask.bag with nltk or gensim to perform textual analyis on the data
  • Reproduce the work of Daniel Rodriguez and see if you can improve upon his speeds when analyzing this data.

NYC 311

Download link

All 311 service requests since 2010 in New York City

European Centre for Medium Range Weather Forecasts

Download script

Download historical global weather data from the ECMWF.

Try the following:

  • What is the variance in temperature over time?
  • What areas experienced the largest temperature swings in the last month relative to their previous history?
  • Plot the temperature of the earth as a function of latitude and then as longitude