1 Data inspection

1.1 Libraries and definitions

In [1]:
# Python libs
import warnings
import joblib
from pathlib import Path
from itertools import product
from math import ceil
import numpy as np
import pandas as pd

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.nonparametric.smoothers_lowess import lowess

# Module settings
mpl.rc("figure", dpi=96)
plt.rcParams['font.sans-serif']=['Microsoft YaHei'] #Show Chinese label
sns.set()
#pd.set_option('display.max_columns', None)  # show all columns
In [6]:
DATA_INPUT_DIR = "data_input/"
DATA_PROCESSED_DIR = "data_processed/"

TRAIN_CSV = DATA_INPUT_DIR + "train_1.csv"
TRAIN2_CSV = DATA_INPUT_DIR + "train_2.csv"
KEY_1_CSV = DATA_INPUT_DIR + "key_1.csv"
KEY_2_CSV = DATA_INPUT_DIR + "key_2.csv"
SAMPLE_SUBMISSION_1_CSV = DATA_INPUT_DIR + "sample_submission_1.csv"
SAMPLE_SUBMISSION_2_CSV = DATA_INPUT_DIR + "sample_submission_2.csv"

TRAIN_FLAT = DATA_PROCESSED_DIR + "train_flat.pkl"
In [3]:
def set_xlabel_rotation(ax, deg=90):
    for label in ax.get_xticklabels():
        l = label.set_rotation(deg)

1.2 Load data

In [4]:
train = pd.read_csv(TRAIN_CSV)
train.shape
train.info()
train
Out[4]:
(145063, 551)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145063 entries, 0 to 145062
Columns: 551 entries, Page to 2016-12-31
dtypes: float64(550), object(1)
memory usage: 609.8+ MB
Out[4]:
Page 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 2NE1_zh.wikipedia.org_all-access_spider 18.0 11.0 5.0 13.0 14.0 9.0 9.0 22.0 26.0 ... 32.0 63.0 15.0 26.0 14.0 20.0 22.0 19.0 18.0 20.0
1 2PM_zh.wikipedia.org_all-access_spider 11.0 14.0 15.0 18.0 11.0 13.0 22.0 11.0 10.0 ... 17.0 42.0 28.0 15.0 9.0 30.0 52.0 45.0 26.0 20.0
2 3C_zh.wikipedia.org_all-access_spider 1.0 0.0 1.0 1.0 0.0 4.0 0.0 3.0 4.0 ... 3.0 1.0 1.0 7.0 4.0 4.0 6.0 3.0 4.0 17.0
3 4minute_zh.wikipedia.org_all-access_spider 35.0 13.0 10.0 94.0 4.0 26.0 14.0 9.0 11.0 ... 32.0 10.0 26.0 27.0 16.0 11.0 17.0 19.0 10.0 11.0
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 48.0 9.0 25.0 13.0 3.0 11.0 27.0 13.0 36.0 10.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145058 Underworld_(serie_de_películas)_es.wikipedia.o... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 13.0 12.0 13.0 3.0 5.0 10.0
145059 Resident_Evil:_Capítulo_Final_es.wikipedia.org... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145060 Enamorándome_de_Ramón_es.wikipedia.org_all-acc... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145061 Hasta_el_último_hombre_es.wikipedia.org_all-ac... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145062 Francisco_el_matemático_(serie_de_televisión_d... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

145063 rows × 551 columns

  • The train_1 data contains 145063 page/access/agent combinations and spans over 550 days from 2015-07-01 to 2016-12-31.
  • The dataset size is about 610 MB in memory.

1.3 Missing values

In [5]:
missing_ratio = train[train.columns[1:]].isnull().sum().sum() / train[train.columns[1:]].size
print(f"{missing_ratio*100:.2f}% missing values.")
7.76% missing values.
In [6]:
# For exploratory data analysis, NaN values are not filled, as 0-filling will change summary statistics and distributions. It will be done at the training stage.
#train.fillna(value=0, inplace=True)

There are about 7.8% missing values in the train_1 dataset, which need to be imputed later at the modeling stage.

2 Data Preparation

  • We are going to split Page column to multiple featuers and store them in a metadata dataframe.
  • Page views time series will be stored separately in a time series dataframe, with dates represented in integer counting.
In [6]:
meta = train['Page'].str.rsplit('_', n=3, expand=True)
meta.columns = ['name', 'project', 'access', 'agent']
meta.head()
Out[6]:
name project access agent
0 2NE1 zh.wikipedia.org all-access spider
1 2PM zh.wikipedia.org all-access spider
2 3C zh.wikipedia.org all-access spider
3 4minute zh.wikipedia.org all-access spider
4 52_Hz_I_Love_You zh.wikipedia.org all-access spider
In [7]:
for col in ['project', 'access', 'agent']:
    meta[col].unique()
Out[7]:
array(['zh.wikipedia.org', 'fr.wikipedia.org', 'en.wikipedia.org',
       'commons.wikimedia.org', 'ru.wikipedia.org', 'www.mediawiki.org',
       'de.wikipedia.org', 'ja.wikipedia.org', 'es.wikipedia.org'],
      dtype=object)
Out[7]:
array(['all-access', 'desktop', 'mobile-web'], dtype=object)
Out[7]:
array(['spider', 'all-agents'], dtype=object)
  • 3 project websites: wikipedia, wikimedia, and mediawiki. Most of visits are on wikipedia.org in 7 languages.
  • 3 access types: all-access, desktop and mobile-web.
  • 2 kinds of agents: spider or all-agents.
In [8]:
lang_project = meta['project'].str.split('.', n=1, expand=True)
lang_project.columns = ['lang', 'site']
meta = pd.concat([meta[['name', 'access', 'agent']], lang_project], axis=1)

meta['lang'] = meta['lang'].str.replace('www', 'mw')
meta['lang'] = meta['lang'].str.replace('commons', 'wm')

meta['lang'].unique()
meta['site'].unique()
Out[8]:
array(['zh', 'fr', 'en', 'wm', 'ru', 'mw', 'de', 'ja', 'es'], dtype=object)
Out[8]:
array(['wikipedia.org', 'wikimedia.org', 'mediawiki.org'], dtype=object)
In [9]:
meta.query('name.str.contains("The_Rolling_Stones")')
Out[9]:
name access agent lang site
6029 The_Rolling_Stones desktop all-agents fr wikipedia.org
8696 Blue_&_Lonesome_(The_Rolling_Stones_album) desktop all-agents en wikipedia.org
12725 The_Rolling_Stones desktop all-agents en wikipedia.org
25420 The_Rolling_Stones all-access all-agents fr wikipedia.org
36536 The_Rolling_Stones all-access spider en wikipedia.org
41456 The_Rolling_Stones all-access all-agents en wikipedia.org
49653 The_Rolling_Stones all-access spider de wikipedia.org
54230 The_Rolling_Stones mobile-web all-agents fr wikipedia.org
68152 The_Rolling_Stones desktop all-agents de wikipedia.org
71361 The_Rolling_Stones desktop all-agents es wikipedia.org
92383 The_Rolling_Stones all-access all-agents es wikipedia.org
96039 The_Rolling_Stones mobile-web all-agents es wikipedia.org
101476 The_Rolling_Stones desktop all-agents ru wikipedia.org
117297 The_Rolling_Stones mobile-web all-agents de wikipedia.org
130495 The_Rolling_Stones all-access spider fr wikipedia.org
140205 The_Rolling_Stones all-access all-agents de wikipedia.org
143608 The_Rolling_Stones all-access spider es wikipedia.org
In [10]:
ts = train[train.columns[5:]]
ts.columns = range(len(ts.columns))
ts
Out[10]:
0 1 2 3 4 5 6 7 8 9 ... 536 537 538 539 540 541 542 543 544 545
0 14.0 9.0 9.0 22.0 26.0 24.0 19.0 10.0 14.0 15.0 ... 32.0 63.0 15.0 26.0 14.0 20.0 22.0 19.0 18.0 20.0
1 11.0 13.0 22.0 11.0 10.0 4.0 41.0 65.0 57.0 38.0 ... 17.0 42.0 28.0 15.0 9.0 30.0 52.0 45.0 26.0 20.0
2 0.0 4.0 0.0 3.0 4.0 4.0 1.0 1.0 1.0 6.0 ... 3.0 1.0 1.0 7.0 4.0 4.0 6.0 3.0 4.0 17.0
3 4.0 26.0 14.0 9.0 11.0 16.0 16.0 11.0 23.0 145.0 ... 32.0 10.0 26.0 27.0 16.0 11.0 17.0 19.0 10.0 11.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 48.0 9.0 25.0 13.0 3.0 11.0 27.0 13.0 36.0 10.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145058 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 13.0 12.0 13.0 3.0 5.0 10.0
145059 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145060 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145061 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145062 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

145063 rows × 546 columns

3 Dataset overview

This section will look into the overall characteristics of the dataset: firstly on the distribution of meta data classes, and then the statistics of time series over the time horizon.

3.1 Overview of metadata

In [11]:
fig, ax = plt.subplots(1, 2, figsize=(12, 3.5))
_ = sns.countplot(data=meta, x='agent', ax=ax.flatten()[0])
_ = sns.countplot(data=meta, x='access', ax=ax.flatten()[1])
fig, ax = plt.subplots(figsize=(12, 3.5))
_ = sns.countplot(data=meta, x='lang')

Counts of pages:

  • all-agents and all-access are the main types regarding agent and access fields.
  • Mobile site entries are slightly more than desktop sites.
  • On wikipedia.org, English, Japanese, German are the top 3 languages.
  • Wikimedia and mediawiki have much less pages than any language of wikipedia.org.
In [103]:
meta_ts = pd.concat([meta, ts], axis=1)

fig, axes = plt.subplots(1, 2, figsize=(12, 3.5))
ax = meta_ts.groupby('agent').mean().T.plot(ax=axes.flatten()[0])
_ = ax.set(xlabel='day', ylabel='Mean of views')
ax = meta_ts.groupby('access').mean().T.plot(ax=axes.flatten()[1])
_ = ax.set(xlabel='day', ylabel='Mean of views')
ax = meta_ts.groupby('lang').mean().T.plot(figsize=(12, 3.5))
_ = ax.set(xlabel='day', ylabel='Mean of views')
  • Regarding average views per day, all-agents views are much higher than spider, while 3 access methods are close, except desktop access is slightly higher.
  • Over time horizon, English page views are much higher than other languages, and Spanish follows as the 2nd one.
  • For some reason, wikipedia English and Russian views jumped up in summer 2016.

3.2 Statistics of time series

In [46]:
ts_stats = pd.DataFrame({
    'mean': ts.mean(axis=1),
    'std': ts.std(axis=1),
    'max': ts.max(axis=1),
    'min': ts.min(axis=1),
})
In [40]:
fig, ax = plt.subplots(1, 3, figsize=(16, 4));
p = sns.histplot(data=ts_stats[ts_stats['mean']>0]['mean'], kde=True, ax=ax.flatten()[0], log_scale=True)
p = sns.histplot(data=ts_stats[ts_stats['max']>0]['max'], kde=True, ax=ax.flatten()[1], log_scale=True)
p = sns.histplot(data=ts_stats[ts_stats['min']>0]['min'], kde=True, ax=ax.flatten()[2], log_scale=True)
In [51]:
fig, ax = plt.subplots(1, 2, figsize=(16, 4))
p = sns.histplot(data=ts_stats[ts_stats['std']>0]['std'], kde=True, ax=ax.flatten()[0], log_scale=True)
ts_std_norm = ts_stats[ts_stats['std']>0]['std']/ts_stats[ts_stats['std']>0]['mean']
ts_std_norm.name = 'std (mean normalized)'
p = sns.histplot(data=ts_std_norm, kde=True, ax=ax.flatten()[1], log_scale=True)

Distributions of average views:

  • Bimodal distributed for mean and max views across wide range from 1 to 1e+5 views.
  • The normalized standard deviation has a right-skewed distribution.
In [88]:
ts_stats_lang_all = {'mean': {}, 'max': {}, 'std': {}}

for lang in meta['lang'].unique():
        ts_stats_lang = ts_stats.loc[meta[meta['lang'] == lang].index]
        ts_stats_lang['std'] = ts_stats_lang['std'] / ts_stats_lang['mean']
        for stats_name in ts_stats_lang_all:
            ts_stats_lang_all[stats_name][lang] = ts_stats_lang[ts_stats_lang[stats_name] > 0][stats_name]
In [106]:
fig, axes = plt.subplots(3, 2, figsize=(16, 12))
for i, stats_name in enumerate(ts_stats_lang_all):
    for j, mult_type in enumerate(('layer', 'stack')):
        ax = sns.kdeplot(data=ts_stats_lang_all[stats_name], log_scale=True, multiple=mult_type, ax=axes[i][j])
        _ = ax.set(xlabel=f"{stats_name} of views")
  • English pages have the highest mean of views as expected.
  • Spanish and Japanese view distributions are more widely spreaded.
  • Chinese pages are less viewed than other languages with lower variability.
In [161]:
ax = plt.subplot()
_ = ax.set(xscale='log', yscale='log', xlabel='max_of_views', ylabel='mean_of_views')
ax = sns.scatterplot(x=ts_stats['max'].replace(0, np.nan), y=ts_stats['mean'].replace(0, np.nan), ax=ax)
  • Mean of views and max of views are siginificantly correlated.
In [159]:
meta.loc[ts_stats['max'].sort_values(ascending=False)[:30].index]
Out[159]:
name access agent lang site
38573 Main_Page all-access all-agents en wikipedia.org
9774 Main_Page desktop all-agents en wikipedia.org
99322 Заглавная_страница all-access all-agents ru wikipedia.org
103123 Заглавная_страница desktop all-agents ru wikipedia.org
39180 Special:Search all-access all-agents en wikipedia.org
10403 Special:Search desktop all-agents en wikipedia.org
33644 Main_Page all-access spider en wikipedia.org
74114 Main_Page mobile-web all-agents en wikipedia.org
39945 David_Bowie all-access all-agents en wikipedia.org
41072 Donald_Trump all-access all-agents en wikipedia.org
34257 Special:Search all-access spider en wikipedia.org
40563 Prince_(musician) all-access all-agents en wikipedia.org
40930 404.php all-access all-agents en wikipedia.org
12815 404.php desktop all-agents en wikipedia.org
41830 Web_scraping all-access all-agents en wikipedia.org
12124 Web_scraping desktop all-agents en wikipedia.org
38692 Muhammad_Ali all-access all-agents en wikipedia.org
40058 George_Michael all-access all-agents en wikipedia.org
37758 Debbie_Reynolds all-access all-agents en wikipedia.org
75440 David_Bowie mobile-web all-agents en wikipedia.org
139119 Wikipedia:Hauptseite all-access all-agents de wikipedia.org
92205 Wikipedia:Portada all-access all-agents es wikipedia.org
73348 Donald_Trump mobile-web all-agents en wikipedia.org
26993 Organisme_de_placement_collectif_en_valeurs_mo... all-access all-agents fr wikipedia.org
7213 Organisme_de_placement_collectif_en_valeurs_mo... desktop all-agents fr wikipedia.org
39726 Alan_Rickman all-access all-agents en wikipedia.org
76038 Prince_(musician) mobile-web all-agents en wikipedia.org
74241 Muhammad_Ali mobile-web all-agents en wikipedia.org
76563 George_Michael mobile-web all-agents en wikipedia.org
37600 Carrie_Fisher all-access all-agents en wikipedia.org
In [158]:
top_index = ts_stats[(ts_stats['max']-ts_stats['mean'] > 1e5) & (ts_stats['mean'] > 1e5)].index
pd.concat([meta.loc[top_index], ts_stats.loc[top_index]], axis=1).sort_values('max', ascending=False)[:30]
Out[158]:
name access agent lang site mean std max min
38573 Main_Page all-access all-agents en wikipedia.org 2.195061e+07 9.103455e+06 67264258.0 13658940.0
9774 Main_Page desktop all-agents en wikipedia.org 1.598356e+07 9.557580e+06 62288712.0 8091010.0
99322 Заглавная_страница all-access all-agents ru wikipedia.org 1.979114e+06 2.866695e+06 17846030.0 865616.0
103123 Заглавная_страница desktop all-agents ru wikipedia.org 1.355855e+06 2.889305e+06 17332270.0 343105.0
39180 Special:Search all-access all-agents en wikipedia.org 2.374866e+06 1.029369e+06 16991932.0 1421005.0
10403 Special:Search desktop all-agents en wikipedia.org 1.842565e+06 9.576987e+05 16592075.0 1030746.0
33644 Main_Page all-access spider en wikipedia.org 2.361394e+05 8.352532e+05 9162565.0 9538.0
74114 Main_Page mobile-web all-agents en wikipedia.org 5.717804e+06 1.179137e+06 8752306.0 3645577.0
41072 Donald_Trump all-access all-agents en wikipedia.org 1.649075e+05 3.985485e+05 6137438.0 26675.0
34257 Special:Search all-access spider en wikipedia.org 2.276688e+05 6.306632e+05 6008070.0 512.0
12815 404.php desktop all-agents en wikipedia.org 1.716180e+05 5.455065e+05 5783441.0 1.0
40930 404.php all-access all-agents en wikipedia.org 1.716196e+05 5.455115e+05 5783441.0 1.0
41830 Web_scraping all-access all-agents en wikipedia.org 1.066646e+05 3.758615e+05 4656065.0 351.0
12124 Web_scraping desktop all-agents en wikipedia.org 1.064729e+05 3.757935e+05 4655723.0 279.0
139119 Wikipedia:Hauptseite all-access all-agents de wikipedia.org 2.916477e+06 2.576392e+05 3907598.0 2299516.0
92205 Wikipedia:Portada all-access all-agents es wikipedia.org 1.363754e+06 2.323425e+05 3471430.0 921533.0
116196 Wikipedia:Hauptseite mobile-web all-agents de wikipedia.org 2.023058e+06 1.671310e+05 2384391.0 1682740.0
71199 Wikipedia:Portada desktop all-agents es wikipedia.org 3.006991e+05 1.117682e+05 2351834.0 168259.0
27330 Wikipédia:Accueil_principal all-access all-agents fr wikipedia.org 1.578461e+06 1.071834e+05 1845404.0 1025055.0
45056 Special:CreateAccount all-access all-agents wm wikimedia.org 2.598811e+05 2.227637e+05 1695069.0 7731.0
74690 Special:Search mobile-web all-agents en wikipedia.org 5.321748e+05 2.937616e+05 1615355.0 309483.0
67049 Wikipedia:Hauptseite desktop all-agents de wikipedia.org 7.767872e+05 1.746999e+05 1606295.0 421700.0
81644 Special:CreateAccount desktop all-agents wm wikimedia.org 2.321789e+05 1.971370e+05 1520249.0 6505.0
99537 Служебная:Поиск all-access all-agents ru wikipedia.org 1.887819e+05 7.791912e+04 1412292.0 85970.0
103349 Служебная:Поиск desktop all-agents ru wikipedia.org 1.795157e+05 7.756170e+04 1401653.0 77138.0
95855 Wikipedia:Portada mobile-web all-agents es wikipedia.org 1.025066e+06 1.645416e+05 1361277.0 669746.0
55104 Wikipédia:Accueil_principal mobile-web all-agents fr wikipedia.org 1.110382e+06 1.469008e+05 1312566.0 826001.0
39172 Special:Book all-access all-agents en wikipedia.org 2.445100e+05 1.269671e+05 1090022.0 89782.0
10399 Special:Book desktop all-agents en wikipedia.org 2.432174e+05 1.267965e+05 1088563.0 89492.0
40689 Special:RecentChanges all-access all-agents en wikipedia.org 1.166168e+05 1.285356e+05 1064778.0 14790.0

4 Correlation of views ~ day

4.1 Distribution of $\rho$

In [20]:
day_index = pd.Series(range(len(list(ts))))
rho = ts.apply(lambda row: row.corr(day_index), axis=1)
rho
/home/ning/apps/mambaforge/envs/ml/lib/python3.9/site-packages/numpy/lib/function_base.py:2551: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar)
/home/ning/apps/mambaforge/envs/ml/lib/python3.9/site-packages/numpy/lib/function_base.py:2480: RuntimeWarn