This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP) This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.
This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising.
Here is a detailed description of the fields original dataset:
f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
treatment: treatment group. Flag if a company participates in the RTB auction for a particular user (binary: 1 = treated, 0 = control)
exposure: treatment effect, whether the user has been effectively exposed. Flag if a company wins in the RTB auction for the user (binary)
conversion: whether a conversion occured for this user (binary, label)
visit: whether a visit occured for this user (binary, label)
import sys
# install uplift library scikit-uplift and other libraries
!{sys.executable} -m pip install scikit-uplift dill lightgbm
from sklift.datasets import fetch_criteo
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklift.models import TwoModels
import lightgbm as lgb
from sklift.metrics import qini_auc_score
from sklift.viz import plot_qini_curve
seed=31
Dataset can be loaded from sklift.datasets
module using fetch_criteo
function. There are few function parameters:
Let's load the dataset with default parameters (target = 'visit', treatment = 'treatment').
# returns sklearn Bunch object
# with data, target, treatment keys
# data features (pd.DataFrame), target (pd.Series), treatment (pd.Series) values
dataset = fetch_criteo()
print(f"Dataset type: {type(dataset)}\n")
print(f"Dataset features shape: {dataset.data.shape}")
print(f"Dataset target shape: {dataset.target.shape}")
print(f"Dataset treatment shape: {dataset.treatment.shape}")
Let's have a look at the data features.
dataset.data.info()
dataset.data.head().append(dataset.data.tail())
dataset.data.describe()
print('Number NA:', dataset.data.isna().sum().sum())
Some notes:
Also take a look at target and treatment.
sns.countplot(x=dataset.treatment)
dataset.treatment.value_counts()
sns.countplot(x=dataset.target)
dataset.target.value_counts()
pd.crosstab(dataset.treatment, dataset.target, normalize='index')
Just note that the target and treatment groups are not balanced.
Optimizing the size of the dataset for low memory environment.
for c in dataset.data.columns:
dataset.data[c] = pd.to_numeric(dataset.data[c], downcast='float')
dataset.treatment = dataset.treatment.astype('int8')
dataset.target = dataset.target.astype('int8')
In a binary classification problem definition we stratify train set by splitting target 0/1
column. In uplift modeling we have two columns instead of one.
stratify_cols = pd.concat([dataset.treatment, dataset.target], axis=1)
X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
dataset.data,
dataset.treatment,
dataset.target,
stratify=stratify_cols,
test_size=0.3,
random_state=31
)
print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")
treatment_model = lgb.LGBMClassifier(random_state=31)
control_model = lgb.LGBMClassifier(random_state=31)
tm = TwoModels(estimator_trmnt = treatment_model, estimator_ctrl = control_model, method='vanilla')
tm = tm.fit(X_train, y_train, trmnt_train)
uplift_tm = tm.predict(X_val)
# AUQC = area under Qini curve = Qini coefficient
auqc = qini_auc_score(y_val, uplift_tm, trmnt_val)
print(f"Qini coefficient on full data: {auqc:.4f}")
# with ideal Qini curve (red line)
# perfect=True
plot_qini_curve(y_val, uplift_tm, trmnt_val, perfect=True);