This notebook shows the historical count and future estimate of the number of *.ipynb
files on GitHub. The daily count comes from executing the query extension:ipynb nbformat_minor once a day, on most days. We re-render the notebook and publish it daily after the update.
*.ipynb
file hits.jupyter/datascience-notebook:9c0c4a1fc008
Docker imagebeautifulsoup4==4.4.1
which is installed in the next cell!pip install 'beautifulsoup4==4.4.*' > /dev/null
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
import warnings
warnings.simplefilter('ignore')
%matplotlib inline
import time
import requests
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.api as sm
from __future__ import division
from bs4 import BeautifulSoup
mpl.style.use('ggplot')
figsize = (14,7)
today = time.strftime("%Y-%m-%d")
print('This notebook was last rendered on {}.'.format(today))
This notebook was last rendered on 2016-08-14.
First, let's load the historical data into a DataFrame indexed by date.
hits_df = pd.read_csv('ipynb_counts.csv', index_col=0, header=0, parse_dates=True)
hits_df.reset_index(inplace=True)
hits_df.drop_duplicates(cols='date', inplace=True)
hits_df.set_index('date', inplace=True)
hits_df.tail(3)
hits | |
---|---|
date | |
2016-08-11 | 465395 |
2016-08-12 | 466405 |
2016-08-13 | 467213 |
Now let's fetch (and save) today's count if we haven't already.
if today not in hits_df.index:
resp = requests.get('https://github.com/search?p=2&q=extension%3Aipynb+nbformat_minor&ref=searchresults&type=Code&utf8=%E2%9C%93')
resp.raise_for_status()
soup = BeautifulSoup(resp.content, 'html.parser')
elem = soup.find('span', {'class': 'counter'})
count_now = int(elem.text.replace(',',''))
# make sure the index remains a timestamp
hits_df.loc[pd.Timestamp(today)] = count_now
# save to the same format we read
hits_df.to_csv('ipynb_counts.csv', date_format='%Y-%m-%d', index_label='date')
hits_df.tail(3)
hits | |
---|---|
2016-08-12 | 466405 |
2016-08-13 | 467213 |
2016-08-14 | 467836 |
There might be missing counts for days that we failed to sample. We build up the expected date range and insert NaNs for dates we missed.
til_today = pd.date_range(hits_df.index[0], hits_df.index[-1])
hits_df = hits_df.reindex(til_today)
Now we plot the known notebook counts for each day we've been tracking the query results.
ax = hits_df.plot(title="GitHub search hits for {} days".format(len(hits_df)), figsize=figsize)
ax.set_xlabel('Date')
ax.set_ylabel('# of ipynb files')
<matplotlib.text.Text at 0x7f394f0798d0>
The outliers in the data are from GitHub reporting drastically different counts when we sample. We suspect this happens when they rebuild their search index. We'll filter them out now by removing any daily change greater than 2.5 standard deviations from the mean daily change.
daily_deltas = (hits_df.hits - hits_df.hits.shift()).fillna(0)
outliers = abs(daily_deltas - daily_deltas.mean()) > 2.5*daily_deltas.std()
hits_df.ix[outliers] = np.NaN
Now we'll do simple linear interpolation for any missing values over days that we failed to sample and days that had outlier counts.
hits_df = hits_df.interpolate(method='time')
ax = hits_df.plot(title="GitHub search hits for {} days sans outliers".format(len(hits_df)),
figsize=figsize)
ax.set_xlabel('Date')
_ = ax.set_ylabel('# of ipynb files')
The total change in the number of *.ipynb
hits between the tracking start date and today is:
total_delta_nbs = hits_df.iloc[-1] - hits_df.iloc[0]
total_delta_nbs
hits 401988 dtype: float64
The daily average change is:
avg_delta_nbs = total_delta_nbs / len(hits_df)
avg_delta_nbs
hits 592.902655 dtype: float64
We can look at the daily change over the entire period. We can also plot the rolling 30-day mean of the daily deltas.
daily_deltas = (hits_df.hits - hits_df.hits.shift()).fillna(0)
fig, ax = plt.subplots(figsize=figsize)
ax.plot(pd.rolling_mean(daily_deltas, window=30, min_periods=0),
label='30-day rolling mean of daily-change')
ax.plot(daily_deltas, label='24-hour change')
ax.set_xlabel('Date')
ax.set_ylabel('Delta notebook count')
ax.set_title('Change in notebook count')
_ = ax.legend(loc='upper left')
Let's look at the rolling mean in isolation.
fig, ax = plt.subplots(figsize=figsize)
ax.plot(pd.rolling_mean(daily_deltas, window=30, min_periods=0))
ax.set_xlabel('Date')
ax.set_ylabel('Delta notebook count')
_ = ax.set_title('30-day rolling mean of daily-change')
We next train an autoregressive model on the time series data. We then use the model to predict the number of notebooks on GitHub a few months out.
def train(df):
ar_model = sm.tsa.AR(df, freq='D')
ar_model_res = ar_model.fit(ic='bic')
return ar_model_res
We look at the model using all data up to and including today's count, plus two historical models.
start_date='2014-10-20'
end_date='2017-01-01'
model_dates = [today, '2015-06-01', '2014-11-15']
models = [train(hits_df.loc[:date]) for date in model_dates]
We see that the most recently selected model has more parameters allowed according to the BIC methodology.
pd.DataFrame([m.params for m in models], index=model_dates).T
2016-08-14 | 2015-06-01 | 2014-11-15 | |
---|---|---|---|
L1.hits | 1.076589 | 1.003372 | 1.001698 |
L2.hits | -0.285839 | NaN | NaN |
L3.hits | 0.212073 | NaN | NaN |
const | 78.391724 | -22.485983 | 116.524522 |
We predict everything from the start date to the end date, using the model values throughout the range of known truth.
predictions = [model.predict(start=start_date, end=end_date, dynamic=True) for model in models]
We put all of the predictions in a DataFrame alongside the ground truth for plotting.
eval_df = pd.DataFrame(predictions, index=model_dates).T
eval_df['truth'] = hits_df.hits
title = 'GitHub search hits predicted from {} until {}'.format(start_date, end_date)
ax = eval_df.plot(title=title, figsize=figsize)
_ = ax.set_ylabel('# of ipynb files')
We plot the residuals for each model to get a sense of how accurate it is as time marches on.
residual_df = -eval_df.subtract(eval_df.truth, axis=0).dropna().drop('truth', axis=1)
fig, ax = plt.subplots(figsize=figsize)
for i, (name, column) in enumerate(residual_df.iteritems()):
ax.scatter(residual_df.index, column, c=mpl.rcParams['axes.color_cycle'][i], label=name)
ax.legend(loc='upper left')
ax.set_ylabel('# of ipynb files')
ax.set_title('Residuals between predicted and truth')
fig.autofmt_xdate()