Lenta is a russian food retailer.
Lenta dataset for uplift modeling contains data about Lenta's customers grociery shopping and related marketing campaigns.
The dataset was originally released for the BIGTARGET Hackathon by LENTA and Microsoft and is accessible from sklift.datasets
module using fetch_lenta
function.
Read more about dataset in the api docs.
import sys
# install uplift library scikit-uplift and other libraries
!{sys.executable} -m pip install scikit-uplift catboost scikit-learn seaborn matplotlib pandas numpy
from sklift.datasets import fetch_lenta
from sklift.models import ClassTransformation
from sklift.metrics import uplift_at_k
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
%matplotlib inline
# returns sklearn Bunch object
# with data, target, treatment keys
# data features (pd.DataFrame), target (pd.Series), treatment (pd.Series) values
dataset = fetch_lenta()
print(f"Dataset type: {type(dataset)}\n")
print(f"Dataset features shape: {dataset.data.shape}")
print(f"Dataset target shape: {dataset.target.shape}")
print(f"Dataset treatment shape: {dataset.treatment.shape}")
dataset.keys()
Dataset is a dictionary-like object with the following attributes:
data
(DataFrame object): Dataset without target and treatment.target
(Series object): Column target by values.treatment
(Series object): Column treatment by values.DESCR
(str): Description of the Lenta dataset.feature_names
(list): Names of the features.target_name
(str): Name of the target.treatment_name
(str): Name of the treatment.Major columns:
group
(str): test/control group flagresponse_att
(binary): targetgender
(str): customer genderage
(float): customer agemain_format
(int): store type (1 - grociery store, 0 - superstore)Detailed feature description could be found here.
We can specify the path to the destination folder and the name of the folder where the dataset should be stored with data_home
and dest_subdir
parameters. By default the path is /
.
# data_home, dest_subdir = "/etc", "data"
# dataset = fetch_lenta(data_home=data_home, dest_subdir=dest_subdir)
We can load and return data, target, and treatment with setting the return_X_y_t
parameter to True
. By default return_X_y_t=False
.
# data, target, treatment = fetch_lenta(return_X_y_t=True)
treatment / control
¶fig, ax = plt.subplots(1,2, sharey=True, figsize=(15,4))
treatment = dataset["treatment"]
target = dataset["target"]
sns.countplot(x=treatment, ax=ax[0])
sns.countplot(x=target, ax=ax[1])
The current sample is unbalanced in terms of both treatment and target.
def crosstab_plot(treatment, target):
ct = pd.crosstab(treatment, target, normalize='index')
sns.heatmap(ct, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r')
plt.ylabel('Treatment')
plt.xlabel('Target')
plt.title("Treatment - Target", size = 15)
crosstab_plot(dataset.treatment, dataset.target)
fig, ax = plt.subplots(1,2, figsize=(15,4))
test_index = dataset.treatment[dataset.treatment == 'test'].index
control_index = dataset.treatment[dataset.treatment == 'control'].index
sns.distplot(dataset.data.loc[test_index, 'response_sms'], label='test', ax=ax[0])
sns.distplot(dataset.data.loc[control_index, 'response_sms'], label='control', ax=ax[0])
ax[0].title.set_text('Test & Control response SMS Distribution')
ax[0].legend()
sns.distplot(dataset.data.loc[test_index, 'age'], label='test', ax=ax[1])
sns.distplot(dataset.data.loc[control_index, 'age'], label='control', ax=ax[1])
ax[1].title.set_text('Test & Control age Distribution')
ax[1].legend()
Clients from the test treatment group tend to respond to sms with a slightly greater probability than clients from the control group. The behavior in the test and control groups does not differ depending on the clients age.
dataset.data.info()
dataset.data.head().append(dataset.data.tail())
# check NaN values ratio
pd.DataFrame({"Total" : dataset.data.isna().sum().sort_values(ascending = False),
"Percentage" : round(dataset.data.isna().sum().sort_values(ascending = False) / len(dataset.data), 3)}).head(20)
print('Total missed data percentage:',
round(100*dataset.data.isna().sum().sum()/(dataset.data.shape[0]*dataset.data.shape[1]), 2), '%')
Transform categorical columns gender
and treatment
into binary.
# make treatment binary
treat_dict = {
'test': 1,
'control': 0
}
dataset.treatment = dataset.treatment.map(treat_dict)
# make gender binary
gender_dict = {
'M': 1,
'Ж': 0
}
dataset.data.gender = dataset.data.gender.map(gender_dict)
f = plt.figure(figsize=(19, 15))
plt.matshow(dataset.data.corr(), fignum=f.number)
cb = plt.colorbar()
cb.ax.tick_params(axis=u'both', which=u'both',length=0)
plt.title('Correlation Matrix', fontsize=16);
Intuition:
In a binary classification problem definition we stratify train set by splitting target 0/1
column. In uplift modeling we have two columns instead of one.
stratify_cols = pd.concat([dataset.treatment, dataset.target], axis=1)
X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
dataset.data,
dataset.treatment,
dataset.target,
stratify=stratify_cols,
test_size=0.3,
random_state=42
)
print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
estimator = CatBoostClassifier(verbose=100,
random_state=42,
thread_count=1)
ct_model = ClassTransformation(estimator=estimator)
my_pipeline = Pipeline([
('imputer', imp_mode),
('model', ct_model)
])
Usual fit pipeline but with aditional treatment parameter
model__treatment = trmnt_train
.
my_pipeline = my_pipeline.fit(
X=X_train,
y=y_train,
model__treatment=trmnt_train
)
Predict uplift and calculate basic uplift metric [email protected]% at first 30%. Read more about the metric in docs.
uplift_predictions = my_pipeline.predict(X_val)
uplift_30 = uplift_at_k(y_val, uplift_predictions, trmnt_val, strategy='overall')
print(f'[email protected]%: {uplift_30:.4f}')