Lenta Uplift Modeling Dataset¶

Lenta is a russian food retailer.

Lenta dataset for uplift modeling contains data about Lenta's customers grociery shopping and related marketing campaigns.

The dataset was originally released for the BIGTARGET Hackathon by LENTA and Microsoft and is accessible from sklift.datasets module using fetch_lenta function.

Read more about dataset in the api docs.

Load Lenta dataset¶

In [ ]:

import sys

# install uplift library scikit-uplift and other libraries 
!{sys.executable} -m pip install scikit-uplift catboost scikit-learn seaborn matplotlib pandas numpy

In [2]:

from sklift.datasets import fetch_lenta
from sklift.models import ClassTransformation
from sklift.metrics import uplift_at_k
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)

%matplotlib inline

In [3]:

# returns sklearn Bunch object
# with data, target, treatment keys
# data features (pd.DataFrame), target (pd.Series), treatment (pd.Series) values 
dataset = fetch_lenta()

print(f"Dataset type: {type(dataset)}\n")
print(f"Dataset features shape: {dataset.data.shape}")
print(f"Dataset target shape: {dataset.target.shape}")
print(f"Dataset treatment shape: {dataset.treatment.shape}")

dataset.keys()

Dataset type: <class 'sklearn.utils.Bunch'>

Dataset features shape: (687029, 193)
Dataset target shape: (687029,)
Dataset treatment shape: (687029,)

Out[3]:

dict_keys(['data', 'target', 'treatment', 'DESCR', 'feature_names', 'target_name', 'treatment_name'])

Dataset is a dictionary-like object with the following attributes:

data (DataFrame object): Dataset without target and treatment.
target (Series object): Column target by values.
treatment (Series object): Column treatment by values.
DESCR (str): Description of the Lenta dataset.
feature_names (list): Names of the features.
target_name (str): Name of the target.
treatment_name (str): Name of the treatment.

Major columns:

treatment group (str): test/control group flag
target response_att (binary): target
data gender (str): customer gender
data age (float): customer age
data main_format (int): store type (1 - grociery store, 0 - superstore)

Detailed feature description could be found here.

We can specify the path to the destination folder and the name of the folder where the dataset should be stored with data_home and dest_subdir parameters. By default the path is /.

In [4]:

# data_home, dest_subdir = "/etc", "data"
# dataset = fetch_lenta(data_home=data_home, dest_subdir=dest_subdir)

We can load and return data, target, and treatment with setting the return_X_y_t parameter to True. By default return_X_y_t=False.

In [5]:

# data, target, treatment = fetch_lenta(return_X_y_t=True)

Target share for `treatment / control`¶

In [6]:

fig, ax = plt.subplots(1,2, sharey=True, figsize=(15,4))

treatment = dataset["treatment"]
target = dataset["target"]

sns.countplot(x=treatment, ax=ax[0])
sns.countplot(x=target, ax=ax[1])

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d634c2400>

The current sample is unbalanced in terms of both treatment and target.

In [7]:

def crosstab_plot(treatment, target):
    ct = pd.crosstab(treatment, target, normalize='index')
    
    sns.heatmap(ct, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r')
    plt.ylabel('Treatment')
    plt.xlabel('Target')
    plt.title("Treatment - Target", size = 15)
    
crosstab_plot(dataset.treatment, dataset.target)

Distributions of some features by treatment¶

In [8]:

fig, ax = plt.subplots(1,2, figsize=(15,4))

test_index = dataset.treatment[dataset.treatment == 'test'].index
control_index = dataset.treatment[dataset.treatment == 'control'].index

sns.distplot(dataset.data.loc[test_index, 'response_sms'], label='test', ax=ax[0])
sns.distplot(dataset.data.loc[control_index, 'response_sms'], label='control', ax=ax[0])
ax[0].title.set_text('Test & Control response SMS Distribution')
ax[0].legend()

sns.distplot(dataset.data.loc[test_index, 'age'], label='test', ax=ax[1])
sns.distplot(dataset.data.loc[control_index, 'age'], label='control', ax=ax[1])
ax[1].title.set_text('Test & Control age Distribution')
ax[1].legend()

Out[8]:

<matplotlib.legend.Legend at 0x7f9d399d1100>

Clients from the test treatment group tend to respond to sms with a slightly greater probability than clients from the control group. The behavior in the test and control groups does not differ depending on the clients age.

Data analysys¶

In [9]:

dataset.data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 687029 entries, 0 to 687028
Columns: 193 entries, age to stdev_discount_depth_1m
dtypes: float64(191), int64(1), object(1)
memory usage: 1011.6+ MB

In [10]:

dataset.data.head().append(dataset.data.tail())

Out[10]:

	age	cheque_count_12m_g20	cheque_count_12m_g21	cheque_count_12m_g25	cheque_count_12m_g32	cheque_count_12m_g33	cheque_count_12m_g38	cheque_count_12m_g39	cheque_count_12m_g41	cheque_count_12m_g42	cheque_count_12m_g45	cheque_count_12m_g46	cheque_count_12m_g48	cheque_count_12m_g52	cheque_count_12m_g56	cheque_count_12m_g57	cheque_count_12m_g58	cheque_count_12m_g79	cheque_count_3m_g20	cheque_count_3m_g21	cheque_count_3m_g25	cheque_count_3m_g42	cheque_count_3m_g45	cheque_count_3m_g52	cheque_count_3m_g56	cheque_count_3m_g57	cheque_count_3m_g79	cheque_count_6m_g20	cheque_count_6m_g21	cheque_count_6m_g25	cheque_count_6m_g32	cheque_count_6m_g33	cheque_count_6m_g38	cheque_count_6m_g39	cheque_count_6m_g40	cheque_count_6m_g41	cheque_count_6m_g42	cheque_count_6m_g45	cheque_count_6m_g46	cheque_count_6m_g48	cheque_count_6m_g52	cheque_count_6m_g56	cheque_count_6m_g57	cheque_count_6m_g58	cheque_count_6m_g79	children	crazy_purchases_cheque_count_12m	crazy_purchases_cheque_count_1m	crazy_purchases_cheque_count_3m	crazy_purchases_cheque_count_6m	crazy_purchases_goods_count_12m	crazy_purchases_goods_count_6m	disc_sum_6m_g34	food_share_15d	food_share_1m	gender	k_var_cheque_15d	k_var_cheque_3m	k_var_cheque_category_width_15d	k_var_cheque_group_width_15d	k_var_count_per_cheque_15d_g24	k_var_count_per_cheque_15d_g34	k_var_count_per_cheque_1m_g24	k_var_count_per_cheque_1m_g27	k_var_count_per_cheque_1m_g34	k_var_count_per_cheque_1m_g44	k_var_count_per_cheque_1m_g49	k_var_count_per_cheque_3m_g24	k_var_count_per_cheque_3m_g27	k_var_count_per_cheque_3m_g32	k_var_count_per_cheque_3m_g34	k_var_count_per_cheque_3m_g41	k_var_count_per_cheque_3m_g44	k_var_count_per_cheque_6m_g24	k_var_count_per_cheque_6m_g27	k_var_count_per_cheque_6m_g32	k_var_count_per_cheque_6m_g44	k_var_days_between_visits_15d	k_var_days_between_visits_1m	k_var_days_between_visits_3m	k_var_disc_per_cheque_15d	k_var_disc_share_12m_g32	k_var_disc_share_15d_g24	k_var_disc_share_15d_g34	k_var_disc_share_15d_g49	k_var_disc_share_1m_g24	k_var_disc_share_1m_g27	k_var_disc_share_1m_g34	k_var_disc_share_1m_g40	k_var_disc_share_1m_g44	k_var_disc_share_1m_g49	k_var_disc_share_1m_g54	k_var_disc_share_3m_g24	k_var_disc_share_3m_g26	k_var_disc_share_3m_g27	k_var_disc_share_3m_g32	k_var_disc_share_3m_g33	k_var_disc_share_3m_g34	k_var_disc_share_3m_g38	k_var_disc_share_3m_g40	k_var_disc_share_3m_g41	k_var_disc_share_3m_g44	k_var_disc_share_3m_g46	k_var_disc_share_3m_g48	k_var_disc_share_3m_g49	k_var_disc_share_3m_g54	k_var_disc_share_6m_g24	k_var_disc_share_6m_g27	k_var_disc_share_6m_g32	k_var_disc_share_6m_g34	k_var_disc_share_6m_g44	k_var_disc_share_6m_g46	k_var_disc_share_6m_g49	k_var_disc_share_6m_g54	k_var_discount_depth_15d	k_var_discount_depth_1m	k_var_sku_per_cheque_15d	k_var_sku_price_12m_g32	k_var_sku_price_15d_g34	k_var_sku_price_15d_g49	k_var_sku_price_1m_g24	k_var_sku_price_1m_g26	k_var_sku_price_1m_g27	k_var_sku_price_1m_g34	k_var_sku_price_1m_g40	k_var_sku_price_1m_g44	k_var_sku_price_1m_g49	k_var_sku_price_1m_g54	k_var_sku_price_3m_g24	k_var_sku_price_3m_g26	k_var_sku_price_3m_g27	k_var_sku_price_3m_g32	k_var_sku_price_3m_g33	k_var_sku_price_3m_g34	k_var_sku_price_3m_g40	k_var_sku_price_3m_g41	k_var_sku_price_3m_g44	k_var_sku_price_3m_g46	k_var_sku_price_3m_g48	k_var_sku_price_3m_g49	k_var_sku_price_3m_g54	k_var_sku_price_6m_g24	k_var_sku_price_6m_g26	k_var_sku_price_6m_g27	k_var_sku_price_6m_g32	k_var_sku_price_6m_g41	k_var_sku_price_6m_g42	k_var_sku_price_6m_g44	k_var_sku_price_6m_g48	k_var_sku_price_6m_g49	main_format	mean_discount_depth_15d	months_from_register	perdelta_days_between_visits_15_30d	promo_share_15d	response_sms	response_viber	sale_count_12m_g32	sale_count_12m_g33	sale_count_12m_g49	sale_count_12m_g54	sale_count_12m_g57	sale_count_3m_g24	sale_count_3m_g33	sale_count_3m_g57	sale_count_6m_g24	sale_count_6m_g25	sale_count_6m_g32	sale_count_6m_g33	sale_count_6m_g44	sale_count_6m_g54	sale_count_6m_g57	sale_sum_12m_g24	sale_sum_12m_g25	sale_sum_12m_g26	sale_sum_12m_g27	sale_sum_12m_g32	sale_sum_12m_g44	sale_sum_12m_g54	sale_sum_3m_g24	sale_sum_3m_g26	sale_sum_3m_g32	sale_sum_3m_g33	sale_sum_6m_g24	sale_sum_6m_g25	sale_sum_6m_g26	sale_sum_6m_g32	sale_sum_6m_g33	sale_sum_6m_g44	sale_sum_6m_g54	stdev_days_between_visits_15d	stdev_discount_depth_15d	stdev_discount_depth_1m
0	47.0	3.0	22.0	19.0	3.0	28.0	8.0	7.0	6.0	1.0	13.0	12.0	16.0	3.0	15.0	11.0	0.0	4.0	0.0	7.0	8.0	0.0	5.0	1.0	6.0	6.0	1.0	0.0	12.0	9.0	1.0	6.0	4.0	2.0	5.0	1.0	0.0	5.0	5.0	6.0	1.0	6.0	9.0	0.0	1.0	0.0	13.0	3.0	5.0	8.0	16.0	11.0	153.09	0.6488	0.3254	Ж	0.7288	1.8741	0.5263	0.7692	NaN	NaN	0.2917	NaN	0.6682	0.5592	0.400	0.5871	0.4654	NaN	0.6055	0.0000	0.5590	0.6183	0.4845	NaN	0.5471	0.4554	0.6479	0.8240	1.4055	1.4080	NaN	NaN	NaN	0.5208	NaN	0.5462	NaN	0.1559	0.0449	0.0000	0.8300	0.0115	0.3846	NaN	0.7418	0.5004	1.2014	1.3485	0.0000	1.2304	0.7229	0.5943	1.5156	0.0147	0.8036	0.6366	NaN	0.7793	1.2143	1.0723	1.3947	0.0123	0.4621	0.4864	0.7067	0.0589	NaN	NaN	0.5946	0.0823	NaN	0.1414	NaN	0.8669	0.3707	0.0000	0.7177	0.0866	1.3485	NaN	0.4640	0.3956	0.1930	0.0000	0.8019	0.1895	0.6128	2.1596	0.6810	0.6546	0.1300	1.2374	NaN	NaN	0.0000	0.8756	0.6718	2.0876	0	0.6055	18.0	1.3393	0.5821	0.923077	0.071429	10.0	84.314	98.0	16.0	11.0	137.282	28.776	6.0	169.658	10.680	7.0	28.776	21.0	8.0	9.0	4469.86	658.85	1286.32	7736.05	418.80	3233.31	811.73	2321.61	182.82	283.84	3648.23	3141.25	356.67	237.25	283.84	3648.23	1195.37	535.42	1.7078	0.2798	0.3008
1	57.0	1.0	0.0	2.0	1.0	1.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	2.0	0.0	1.0	0.0	0.0	0.0	1.0	1.0	0.0	2.0	1.0	1.0	1.0	0.0	3.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	55.99	0.0000	1.0000	Ж	0.0000	0.9630	0.0000	0.0000	NaN	NaN	0.0000	0.0	0.0000	0.0000	0.000	0.0000	0.0000	NaN	1.0102	0.0000	0.0000	NaN	NaN	NaN	0.0000	0.0000	0.0000	1.0027	0.0000	NaN	NaN	NaN	NaN	0.0000	0.0	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.1094	0.0000	NaN	NaN	1.1289	0.0000	0.6188	0.0000	0.0000	0.0000	NaN	0.4981	0.6382	NaN	NaN	NaN	1.1289	0.0000	0.0000	0.4981	0.6382	0.0000	0.0000	0.0000	NaN	NaN	NaN	0.0000	0.0000	0.0	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.2072	0.0000	NaN	NaN	0.3993	0.8333	0.0000	0.0000	0.0000	NaN	0.6192	0.5405	NaN	0.2072	NaN	NaN	NaN	0.0000	0.0000	NaN	0.6192	1	0.0000	4.0	0.0000	0.0000	1.000000	0.000000	1.0	1.000	2.0	2.0	0.0	0.000	1.000	0.0	1.744	2.000	1.0	1.000	0.0	2.0	0.0	113.39	62.69	58.71	93.35	87.01	0.00	122.98	0.00	58.71	87.01	179.83	113.39	62.69	58.71	87.01	179.83	0.00	122.98	0.0000	0.0000	0.0000
2	38.0	7.0	0.0	15.0	4.0	9.0	5.0	9.0	14.0	7.0	6.0	10.0	14.0	5.0	11.0	0.0	3.0	2.0	2.0	0.0	3.0	2.0	1.0	1.0	0.0	0.0	2.0	6.0	0.0	9.0	2.0	5.0	1.0	7.0	7.0	8.0	3.0	2.0	6.0	6.0	3.0	4.0	0.0	0.0	2.0	3.0	0.0	0.0	0.0	0.0	0.0	0.0	290.00	0.3739	0.4768	М	NaN	0.3295	NaN	NaN	0.0	NaN	0.0000	0.0	0.4159	0.8485	0.000	0.0000	0.0302	0.0	0.6009	0.6205	1.0035	0.5712	0.5762	0.4714	0.9830	0.0000	NaN	0.5559	NaN	0.9780	0.0	NaN	NaN	0.0000	0.0	0.0078	NaN	0.8362	1.3183	0.8560	0.0000	0.0000	0.6077	0.0	0.7665	1.2056	NaN	1.0002	0.0541	0.8461	0.5812	0.6945	1.7252	0.7579	0.6608	0.8560	0.9266	1.1554	0.7782	0.7471	2.0674	0.8871	NaN	0.1201	NaN	0.6629	NaN	NaN	0.0000	0.0000	0.0	0.0666	NaN	0.4668	1.3422	0.3536	0.0000	0.0000	0.0100	0.0	0.0457	0.2615	0.5856	0.2870	0.5238	0.2017	0.2840	1.8758	0.6338	0.2654	0.0000	0.4481	0.7673	0.2393	0.2851	0.5170	0.2407	2.5227	0	0.7256	34.0	0.0000	0.7256	1.000000	0.250000	5.0	21.102	50.0	109.0	0.0	0.000	7.594	0.0	25.294	11.084	3.0	11.158	31.0	59.0	0.0	1564.91	971.09	177.93	3257.49	975.21	2555.27	6351.29	0.00	0.00	0.00	783.87	1239.19	533.46	83.37	593.13	1217.43	1336.83	3709.82	0.0000	NaN	0.0803
3	65.0	6.0	3.0	25.0	2.0	10.0	14.0	11.0	8.0	1.0	0.0	2.0	6.0	7.0	2.0	0.0	0.0	0.0	1.0	0.0	5.0	0.0	0.0	1.0	0.0	0.0	0.0	2.0	1.0	11.0	2.0	3.0	5.0	5.0	4.0	2.0	1.0	0.0	1.0	3.0	1.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	3.0	0.0	51.81	0.0000	1.0000	Ж	0.0000	1.4933	0.0000	0.0000	NaN	NaN	0.0000	0.0	0.0000	0.0000	0.000	0.0000	NaN	NaN	0.0000	0.0000	NaN	NaN	0.3295	0.0000	0.0000	0.0000	0.0000	0.7432	0.0000	0.1315	NaN	NaN	NaN	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	NaN	NaN	NaN	0.7904	0.0050	NaN	NaN	0.0000	NaN	0.0000	NaN	0.0166	0.5362	NaN	0.5780	0.1315	0.3219	1.1290	NaN	1.7975	1.2530	0.0000	0.0000	0.0000	0.2354	NaN	NaN	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	NaN	NaN	NaN	0.1655	0.6560	NaN	0.0000	NaN	0.0000	NaN	0.1326	0.1477	NaN	0.7469	0.2352	0.2354	0.6846	NaN	0.2671	0.1028	3.0736	1	0.0000	40.0	0.0000	0.0000	0.909091	0.000000	2.0	12.544	49.0	39.0	0.0	0.000	2.778	0.0	2.000	34.212	2.0	3.778	2.0	13.0	0.0	358.22	3798.18	680.93	1425.07	175.73	602.81	3544.76	0.00	119.99	73.24	346.74	139.68	1849.91	360.40	175.73	496.73	172.58	1246.21	0.0000	0.0000	0.0000
4	61.0	0.0	1.0	2.0	0.0	2.0	1.0	0.0	3.0	2.0	1.0	1.0	5.0	5.0	0.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	2.0	0.0	0.0	1.0	0.0	1.0	2.0	0.0	2.0	1.0	0.0	8.0	2.0	2.0	1.0	1.0	4.0	3.0	0.0	0.0	0.0	1.0	2.0	4.0	1.0	1.0	2.0	4.0	2.0	161.12	0.2882	0.2882	Ж	0.9301	0.9014	0.8165	0.7542	0.0	NaN	0.0000	0.0	NaN	0.0000	0.000	0.0000	0.6682	0.0	0.7781	0.0000	0.0000	0.4826	0.7526	0.0000	NaN	0.4714	0.4714	0.9980	1.3497	0.0000	0.0	NaN	0.0000	0.0000	0.0	NaN	NaN	0.0000	0.0000	1.2273	0.0000	0.0249	0.9172	0.0	NaN	0.6549	0.0000	0.6684	0.4566	0.0000	0.0000	0.0138	1.2899	1.6070	1.0710	0.9058	0.0000	0.5955	NaN	NaN	1.6232	1.3194	0.4903	0.4903	0.9423	0.0000	NaN	0.0000	0.0000	0.0000	0.0	NaN	NaN	0.0000	0.0000	0.2284	0.0000	0.1325	0.2146	0.0	NaN	0.1934	0.5189	0.0058	0.0000	0.0000	0.3868	2.0293	0.6860	0.6113	1.0926	0.2211	0.0000	0.0058	0.2496	NaN	0.2195	1.4917	0	0.7128	20.0	0.0000	0.7865	1.000000	0.100000	0.0	1.454	25.0	25.0	0.0	0.000	0.454	0.0	3.036	12.000	0.0	1.454	8.0	23.0	0.0	226.98	168.05	960.37	1560.21	0.00	342.45	1039.85	0.00	66.18	0.00	87.94	226.98	168.05	461.37	0.00	237.93	225.51	995.27	1.4142	0.3495	0.3495
687024	35.0	0.0	0.0	4.0	0.0	2.0	0.0	1.0	0.0	3.0	2.0	2.0	3.0	2.0	1.0	0.0	1.0	0.0	0.0	0.0	3.0	2.0	1.0	2.0	1.0	0.0	0.0	0.0	0.0	3.0	0.0	2.0	0.0	0.0	5.0	0.0	2.0	2.0	2.0	2.0	2.0	1.0	0.0	1.0	0.0	3.0	0.0	0.0	0.0	0.0	0.0	0.0	130.90	0.6214	0.6924	Ж	1.0931	1.1344	0.8240	0.8281	NaN	0.4714	0.4359	NaN	0.4714	NaN	0.441	0.4359	NaN	0.0	0.4714	0.0000	NaN	0.4359	NaN	0.0000	NaN	0.6614	0.5162	0.5162	1.0710	0.0000	NaN	1.1557	1.1622	0.0058	NaN	1.1557	1.0531	NaN	1.3709	0.0000	0.0058	NaN	NaN	0.0	0.2821	1.1557	0.0000	1.0531	0.0000	NaN	0.8648	1.2152	1.3709	0.0000	0.0058	NaN	0.0000	0.7422	NaN	0.8648	1.4424	NaN	0.6335	0.6177	0.9331	0.0000	0.0437	1.3395	0.0223	NaN	NaN	0.0437	0.5187	NaN	1.6498	0.0000	0.0223	NaN	NaN	0.0	0.0444	0.0437	0.5187	0.0000	NaN	0.4438	1.0347	1.6498	0.0000	0.0223	NaN	NaN	0.0000	0.0000	0.1156	NaN	1.0347	1.7847	0	0.5756	59.0	1.3333	0.4002	0.000000	0.166667	0.0	3.000	14.0	2.0	0.0	19.856	3.000	0.0	19.856	29.000	0.0	3.000	15.0	1.0	0.0	550.09	695.32	111.87	114.21	0.00	1173.84	147.68	550.09	111.87	0.00	330.96	550.09	669.33	111.87	0.00	330.96	1173.84	119.99	2.6458	0.3646	0.3282
687025	33.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00	0.0000	0.0000	М	0.0000	0.0000	0.0000	0.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	NaN	0.0000	0.0000	0.0000	0.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.2382	NaN	0	0.0000	4.0	0.0000	0.0000	1.000000	0.000000	0.0	0.000	1.0	1.0	0.0	NaN	NaN	NaN	0.000	0.000	0.0	0.000	0.0	1.0	0.0	0.00	0.00	0.00	0.00	0.00	0.00	28.01	NaN	NaN	NaN	NaN	0.00	0.00	0.00	0.00	0.00	0.00	28.01	0.0000	0.0000	0.0000
687026	36.0	0.0	0.0	3.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00	0.9847	0.9847	М	NaN	NaN	NaN	NaN	0.0	0.0000	0.0000	0.0	0.0000	NaN	NaN	0.0000	0.0000	0.0	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	NaN	0.0000	0.0	0.0000	NaN	0.0000	0.0	0.0000	0.0000	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	NaN	0.0000	NaN	NaN	NaN	0.0000	0.0000	NaN	0.0000	0.0000	0.0	0.0000	0.0000	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	NaN	1	0.9847	66.0	0.0000	0.9847	1.000000	0.000000	0.0	0.000	5.0	3.0	0.0	0.000	0.000	0.0	0.000	0.000	0.0	0.000	15.0	0.0	0.0	0.00	155.97	23.99	41.51	0.00	615.77	87.47	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	449.01	0.00	0.0000	NaN	NaN
687027	37.0	0.0	1.0	2.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.00	0.3545	0.7263	М	NaN	0.2269	NaN	NaN	0.0	0.0000	0.0000	0.0	0.0000	0.0000	NaN	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	NaN	NaN	0.0000	0.0	0.0000	0.0000	0.0000	0.0	0.0000	NaN	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.0	0.0000	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	NaN	NaN	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0	0.0000	NaN	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.0	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	NaN	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	NaN	0	0.8318	9.0	0.0000	0.8318	1.000000	0.000000	0.0	0.000	1.0	0.0	0.0	0.000	0.000	0.0	0.000	0.476	0.0	0.000	0.0	0.0	0.0	0.00	81.90	29.82	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	46.72	0.00	0.00	0.00	0.00	0.00	0.0000	NaN	NaN
687028	40.0	0.0	1.0	0.0	0.0	2.0	0.0	0.0	2.0	2.0	2.0	2.0	3.0	1.0	1.0	2.0	1.0	4.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	3.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	2.0	2.0	0.0	1.0	1.0	0.0	3.0	3.0	1.0	0.0	0.0	0.0	1.0	0.0	0.00	0.0000	0.0000	Ж	0.0000	0.8408	0.0000	0.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.9895	0.9970	0.0	0.0000	0.0000	0.6667	0.9895	0.9970	0.0000	0.6667	0.0000	0.0000	0.3536	0.0000	0.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.5264	0.0000	0.2564	0.0	NaN	0.0000	0.0000	0.0000	0.0000	1.1059	0.5745	0.2423	0.6481	0.3924	0.5264	0.2564	0.0000	0.0000	1.1059	0.5745	0.6481	0.3924	0.0000	0.0000	0.0000	0.0000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.5179	0.0000	0.2672	0.0	NaN	0.0000	0.0000	0.0000	0.5834	0.1777	0.2541	1.2352	0.3381	0.5179	0.0000	0.2672	0.0000	0.0000	NaN	0.5834	0.2541	1.2352	0	0.0000	13.0	0.0000	0.0000	1.000000	0.100000	0.0	6.452	25.0	17.0	3.0	6.660	1.344	1.0	6.660	0.000	0.0	1.344	18.0	4.0	1.0	531.25	0.00	0.00	916.44	0.00	2407.56	1304.03	290.01	0.00	0.00	228.47	290.01	0.00	0.00	0.00	228.47	752.32	596.86	0.0000	0.0000	0.0000

There are 193 columns in the dataset
The dataset contains:
- basic information about clients (age, number of children)
- information about some groups of goods
- statistical information (variation of discounts, prices)

Missing values¶

In [11]:

# check NaN values ratio
pd.DataFrame({"Total" : dataset.data.isna().sum().sort_values(ascending = False),
              "Percentage" : round(dataset.data.isna().sum().sort_values(ascending = False) / len(dataset.data), 3)}).head(20)
              

Out[11]:

	Total	Percentage
k_var_sku_price_15d_g49	496259	0.722
k_var_disc_share_15d_g49	496159	0.722
k_var_count_per_cheque_15d_g34	468551	0.682
k_var_sku_price_15d_g34	468551	0.682
k_var_disc_share_15d_g34	468467	0.682
k_var_count_per_cheque_15d_g24	442121	0.644
k_var_disc_share_15d_g24	442054	0.643
k_var_sku_price_1m_g49	414473	0.603
k_var_count_per_cheque_1m_g49	414473	0.603
k_var_disc_share_1m_g49	414369	0.603
k_var_sku_price_1m_g54	388217	0.565
k_var_disc_share_1m_g54	388139	0.565
k_var_sku_price_1m_g34	385078	0.560
k_var_count_per_cheque_1m_g34	385078	0.560
k_var_disc_share_1m_g34	384997	0.560
k_var_sku_price_1m_g44	383315	0.558
k_var_count_per_cheque_1m_g44	383315	0.558
k_var_disc_share_1m_g44	383219	0.558
k_var_sku_price_1m_g40	380641	0.554
k_var_disc_share_1m_g40	380559	0.554

In [12]:

print('Total missed data percentage:', 
      round(100*dataset.data.isna().sum().sum()/(dataset.data.shape[0]*dataset.data.shape[1]), 2), '%')

Total missed data percentage: 19.34 %

Data transformation¶

Transform categorical columns gender and treatment into binary.

In [13]:

# make treatment binary
treat_dict = {
    'test': 1,
    'control': 0
}
dataset.treatment = dataset.treatment.map(treat_dict)

# make gender binary
gender_dict = {
    'M': 1,
    'Ж': 0
}
dataset.data.gender = dataset.data.gender.map(gender_dict)

Feature correlation¶

In [14]:

f = plt.figure(figsize=(19, 15))
plt.matshow(dataset.data.corr(), fignum=f.number)
cb = plt.colorbar()
cb.ax.tick_params(axis=u'both', which=u'both',length=0)
plt.title('Correlation Matrix', fontsize=16);

Train/test split¶

stratify by two columns: treatment and target.

Intuition: In a binary classification problem definition we stratify train set by splitting target 0/1 column. In uplift modeling we have two columns instead of one.

In [15]:

stratify_cols = pd.concat([dataset.treatment, dataset.target], axis=1)

X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
    dataset.data,
    dataset.treatment,
    dataset.target,
    stratify=stratify_cols,
    test_size=0.3,
    random_state=42
)

print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")

Train shape: (480920, 193)
Validation shape: (206109, 193)

Pipeline with CatBoostClassifier¶

In [16]:

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
estimator = CatBoostClassifier(verbose=100,
                               random_state=42,
                               thread_count=1)
ct_model = ClassTransformation(estimator=estimator)

my_pipeline = Pipeline([
    ('imputer', imp_mode),
    ('model', ct_model)
])

Usual fit pipeline but with aditional treatment parameter model__treatment = trmnt_train.

In [17]:

my_pipeline = my_pipeline.fit(
    X=X_train, 
    y=y_train, 
    model__treatment=trmnt_train
)

Learning rate set to 0.143939
0:	learn: 0.6695107	total: 421ms	remaining: 7m
100:	learn: 0.5950043	total: 34.2s	remaining: 5m 4s
200:	learn: 0.5908539	total: 1m 5s	remaining: 4m 21s
300:	learn: 0.5870115	total: 1m 39s	remaining: 3m 51s
400:	learn: 0.5835003	total: 2m 13s	remaining: 3m 19s
500:	learn: 0.5800551	total: 2m 47s	remaining: 2m 46s
600:	learn: 0.5768127	total: 3m 21s	remaining: 2m 13s
700:	learn: 0.5736896	total: 3m 54s	remaining: 1m 39s
800:	learn: 0.5706878	total: 4m 27s	remaining: 1m 6s
900:	learn: 0.5676374	total: 5m 2s	remaining: 33.2s
999:	learn: 0.5647908	total: 5m 35s	remaining: 0us

Predict uplift and calculate basic uplift metric uplift@30% at first 30%. Read more about the metric in docs.

In [18]:

uplift_predictions = my_pipeline.predict(X_val)

uplift_30 = uplift_at_k(y_val, uplift_predictions, trmnt_val, strategy='overall')
print(f'uplift@30%: {uplift_30:.4f}')

uplift@30%: 0.0504