This data comes from mybinder.org a web service to run Jupyter notebooks live on the web (you may be running this notebook there now). My Binder publishes records for every time someone launches a live notebook like this one, and stores that record in a publicly accessible JSON file, one file per day.
This data is stored as JSON-encoded text files on the public web. Here are some example lines.
import dask.bag as db db.read_text('https://archive.analytics.mybinder.org/events-2018-11-03.jsonl').take(3)
We see that it includes one line for every time someone started a live notebook on the site. It includes the time that the notebook was started, as well as the repository from which it was served.
In this notebook we'll look at many such files, parse them from JSON to Python dictionaries, and then from there to Pandas dataframes. We'll then do some simple analyses on this data.
Starting the Dask Client is optional. It will start the dashboard which is useful to gain insight on the computation.
from dask.distributed import Client, progress client = Client(threads_per_worker=1, n_workers=4, memory_limit='2GB') client
The mybinder.org team maintains an index file that points to all other available JSON files of data. Lets convert this to a list of URLs that we'll read in the next section.
import dask.bag as db import json
filenames = (db.read_text('https://archive.analytics.mybinder.org/index.jsonl') .map(json.loads) .pluck('name') .compute()) filenames = ['https://archive.analytics.mybinder.org/' + fn for fn in filenames] filenames[:5]
events = db.read_text(filenames).map(json.loads) events.take(2)
Lets do a simple frequency count to find those binders that are run the most often.
df = events.to_dataframe() df.head()
This dataset fits nicely into memory. Lets avoid downloading data every time we do an operation and instead keep the data local in memory.
df = df.persist()
Honestly, at this point it makes more sense to just switch to Pandas, but this is a Dask example, so we'll continue with Dask dataframe.
Most binders are specified as git repositories on GitHub, but not all. Lets investigate other providers.
(df[df.provider == 'GitLab'] .spec .map(urllib.parse.unquote, meta=('spec', object)) .value_counts() .to_frame() .compute())
(df[df.provider == 'Git'] .spec .apply(urllib.parse.unquote, meta=('spec', object)) .value_counts() .to_frame() .compute())