Lenta is a russian food retailer.
Lenta dataset for uplift modeling contains data about Lenta's customers grociery shopping and related marketing campaigns.
The dataset was originally released for the BIGTARGET Hackathon by LENTA and Microsoft and is accessible from sklift.datasets
module using fetch_lenta
function.
Read more about dataset in the api docs.
import sys
# install uplift library scikit-uplift and other libraries
!{sys.executable} -m pip install scikit-uplift catboost scikit-learn seaborn matplotlib pandas numpy
from sklift.datasets import fetch_lenta
from sklift.models import ClassTransformation
from sklift.metrics import uplift_at_k
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
%matplotlib inline
# returns sklearn Bunch object
# with data, target, treatment keys
# data features (pd.DataFrame), target (pd.Series), treatment (pd.Series) values
dataset = fetch_lenta()
print(f"Dataset type: {type(dataset)}\n")
print(f"Dataset features shape: {dataset.data.shape}")
print(f"Dataset target shape: {dataset.target.shape}")
print(f"Dataset treatment shape: {dataset.treatment.shape}")
dataset.keys()
Dataset type: <class 'sklearn.utils.Bunch'> Dataset features shape: (687029, 193) Dataset target shape: (687029,) Dataset treatment shape: (687029,)
dict_keys(['data', 'target', 'treatment', 'DESCR', 'feature_names', 'target_name', 'treatment_name'])
Dataset is a dictionary-like object with the following attributes:
data
(DataFrame object): Dataset without target and treatment.target
(Series object): Column target by values.treatment
(Series object): Column treatment by values.DESCR
(str): Description of the Lenta dataset.feature_names
(list): Names of the features.target_name
(str): Name of the target.treatment_name
(str): Name of the treatment.Major columns:
group
(str): test/control group flagresponse_att
(binary): targetgender
(str): customer genderage
(float): customer agemain_format
(int): store type (1 - grociery store, 0 - superstore)Detailed feature description could be found here.
We can specify the path to the destination folder and the name of the folder where the dataset should be stored with data_home
and dest_subdir
parameters. By default the path is /
.
# data_home, dest_subdir = "/etc", "data"
# dataset = fetch_lenta(data_home=data_home, dest_subdir=dest_subdir)
We can load and return data, target, and treatment with setting the return_X_y_t
parameter to True
. By default return_X_y_t=False
.
# data, target, treatment = fetch_lenta(return_X_y_t=True)
treatment / control
¶fig, ax = plt.subplots(1,2, sharey=True, figsize=(15,4))
treatment = dataset["treatment"]
target = dataset["target"]
sns.countplot(x=treatment, ax=ax[0])
sns.countplot(x=target, ax=ax[1])
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d634c2400>
The current sample is unbalanced in terms of both treatment and target.
def crosstab_plot(treatment, target):
ct = pd.crosstab(treatment, target, normalize='index')
sns.heatmap(ct, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r')
plt.ylabel('Treatment')
plt.xlabel('Target')
plt.title("Treatment - Target", size = 15)
crosstab_plot(dataset.treatment, dataset.target)
fig, ax = plt.subplots(1,2, figsize=(15,4))
test_index = dataset.treatment[dataset.treatment == 'test'].index
control_index = dataset.treatment[dataset.treatment == 'control'].index
sns.distplot(dataset.data.loc[test_index, 'response_sms'], label='test', ax=ax[0])
sns.distplot(dataset.data.loc[control_index, 'response_sms'], label='control', ax=ax[0])
ax[0].title.set_text('Test & Control response SMS Distribution')
ax[0].legend()
sns.distplot(dataset.data.loc[test_index, 'age'], label='test', ax=ax[1])
sns.distplot(dataset.data.loc[control_index, 'age'], label='control', ax=ax[1])
ax[1].title.set_text('Test & Control age Distribution')
ax[1].legend()
<matplotlib.legend.Legend at 0x7f9d399d1100>
Clients from the test treatment group tend to respond to sms with a slightly greater probability than clients from the control group. The behavior in the test and control groups does not differ depending on the clients age.
dataset.data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 687029 entries, 0 to 687028 Columns: 193 entries, age to stdev_discount_depth_1m dtypes: float64(191), int64(1), object(1) memory usage: 1011.6+ MB
dataset.data.head().append(dataset.data.tail())
age | cheque_count_12m_g20 | cheque_count_12m_g21 | cheque_count_12m_g25 | cheque_count_12m_g32 | cheque_count_12m_g33 | cheque_count_12m_g38 | cheque_count_12m_g39 | cheque_count_12m_g41 | cheque_count_12m_g42 | cheque_count_12m_g45 | cheque_count_12m_g46 | cheque_count_12m_g48 | cheque_count_12m_g52 | cheque_count_12m_g56 | cheque_count_12m_g57 | cheque_count_12m_g58 | cheque_count_12m_g79 | cheque_count_3m_g20 | cheque_count_3m_g21 | cheque_count_3m_g25 | cheque_count_3m_g42 | cheque_count_3m_g45 | cheque_count_3m_g52 | cheque_count_3m_g56 | cheque_count_3m_g57 | cheque_count_3m_g79 | cheque_count_6m_g20 | cheque_count_6m_g21 | cheque_count_6m_g25 | cheque_count_6m_g32 | cheque_count_6m_g33 | cheque_count_6m_g38 | cheque_count_6m_g39 | cheque_count_6m_g40 | cheque_count_6m_g41 | cheque_count_6m_g42 | cheque_count_6m_g45 | cheque_count_6m_g46 | cheque_count_6m_g48 | cheque_count_6m_g52 | cheque_count_6m_g56 | cheque_count_6m_g57 | cheque_count_6m_g58 | cheque_count_6m_g79 | children | crazy_purchases_cheque_count_12m | crazy_purchases_cheque_count_1m | crazy_purchases_cheque_count_3m | crazy_purchases_cheque_count_6m | crazy_purchases_goods_count_12m | crazy_purchases_goods_count_6m | disc_sum_6m_g34 | food_share_15d | food_share_1m | gender | k_var_cheque_15d | k_var_cheque_3m | k_var_cheque_category_width_15d | k_var_cheque_group_width_15d | k_var_count_per_cheque_15d_g24 | k_var_count_per_cheque_15d_g34 | k_var_count_per_cheque_1m_g24 | k_var_count_per_cheque_1m_g27 | k_var_count_per_cheque_1m_g34 | k_var_count_per_cheque_1m_g44 | k_var_count_per_cheque_1m_g49 | k_var_count_per_cheque_3m_g24 | k_var_count_per_cheque_3m_g27 | k_var_count_per_cheque_3m_g32 | k_var_count_per_cheque_3m_g34 | k_var_count_per_cheque_3m_g41 | k_var_count_per_cheque_3m_g44 | k_var_count_per_cheque_6m_g24 | k_var_count_per_cheque_6m_g27 | k_var_count_per_cheque_6m_g32 | k_var_count_per_cheque_6m_g44 | k_var_days_between_visits_15d | k_var_days_between_visits_1m | k_var_days_between_visits_3m | k_var_disc_per_cheque_15d | k_var_disc_share_12m_g32 | k_var_disc_share_15d_g24 | k_var_disc_share_15d_g34 | k_var_disc_share_15d_g49 | k_var_disc_share_1m_g24 | k_var_disc_share_1m_g27 | k_var_disc_share_1m_g34 | k_var_disc_share_1m_g40 | k_var_disc_share_1m_g44 | k_var_disc_share_1m_g49 | k_var_disc_share_1m_g54 | k_var_disc_share_3m_g24 | k_var_disc_share_3m_g26 | k_var_disc_share_3m_g27 | k_var_disc_share_3m_g32 | k_var_disc_share_3m_g33 | k_var_disc_share_3m_g34 | k_var_disc_share_3m_g38 | k_var_disc_share_3m_g40 | k_var_disc_share_3m_g41 | k_var_disc_share_3m_g44 | k_var_disc_share_3m_g46 | k_var_disc_share_3m_g48 | k_var_disc_share_3m_g49 | k_var_disc_share_3m_g54 | k_var_disc_share_6m_g24 | k_var_disc_share_6m_g27 | k_var_disc_share_6m_g32 | k_var_disc_share_6m_g34 | k_var_disc_share_6m_g44 | k_var_disc_share_6m_g46 | k_var_disc_share_6m_g49 | k_var_disc_share_6m_g54 | k_var_discount_depth_15d | k_var_discount_depth_1m | k_var_sku_per_cheque_15d | k_var_sku_price_12m_g32 | k_var_sku_price_15d_g34 | k_var_sku_price_15d_g49 | k_var_sku_price_1m_g24 | k_var_sku_price_1m_g26 | k_var_sku_price_1m_g27 | k_var_sku_price_1m_g34 | k_var_sku_price_1m_g40 | k_var_sku_price_1m_g44 | k_var_sku_price_1m_g49 | k_var_sku_price_1m_g54 | k_var_sku_price_3m_g24 | k_var_sku_price_3m_g26 | k_var_sku_price_3m_g27 | k_var_sku_price_3m_g32 | k_var_sku_price_3m_g33 | k_var_sku_price_3m_g34 | k_var_sku_price_3m_g40 | k_var_sku_price_3m_g41 | k_var_sku_price_3m_g44 | k_var_sku_price_3m_g46 | k_var_sku_price_3m_g48 | k_var_sku_price_3m_g49 | k_var_sku_price_3m_g54 | k_var_sku_price_6m_g24 | k_var_sku_price_6m_g26 | k_var_sku_price_6m_g27 | k_var_sku_price_6m_g32 | k_var_sku_price_6m_g41 | k_var_sku_price_6m_g42 | k_var_sku_price_6m_g44 | k_var_sku_price_6m_g48 | k_var_sku_price_6m_g49 | main_format | mean_discount_depth_15d | months_from_register | perdelta_days_between_visits_15_30d | promo_share_15d | response_sms | response_viber | sale_count_12m_g32 | sale_count_12m_g33 | sale_count_12m_g49 | sale_count_12m_g54 | sale_count_12m_g57 | sale_count_3m_g24 | sale_count_3m_g33 | sale_count_3m_g57 | sale_count_6m_g24 | sale_count_6m_g25 | sale_count_6m_g32 | sale_count_6m_g33 | sale_count_6m_g44 | sale_count_6m_g54 | sale_count_6m_g57 | sale_sum_12m_g24 | sale_sum_12m_g25 | sale_sum_12m_g26 | sale_sum_12m_g27 | sale_sum_12m_g32 | sale_sum_12m_g44 | sale_sum_12m_g54 | sale_sum_3m_g24 | sale_sum_3m_g26 | sale_sum_3m_g32 | sale_sum_3m_g33 | sale_sum_6m_g24 | sale_sum_6m_g25 | sale_sum_6m_g26 | sale_sum_6m_g32 | sale_sum_6m_g33 | sale_sum_6m_g44 | sale_sum_6m_g54 | stdev_days_between_visits_15d | stdev_discount_depth_15d | stdev_discount_depth_1m | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 47.0 | 3.0 | 22.0 | 19.0 | 3.0 | 28.0 | 8.0 | 7.0 | 6.0 | 1.0 | 13.0 | 12.0 | 16.0 | 3.0 | 15.0 | 11.0 | 0.0 | 4.0 | 0.0 | 7.0 | 8.0 | 0.0 | 5.0 | 1.0 | 6.0 | 6.0 | 1.0 | 0.0 | 12.0 | 9.0 | 1.0 | 6.0 | 4.0 | 2.0 | 5.0 | 1.0 | 0.0 | 5.0 | 5.0 | 6.0 | 1.0 | 6.0 | 9.0 | 0.0 | 1.0 | 0.0 | 13.0 | 3.0 | 5.0 | 8.0 | 16.0 | 11.0 | 153.09 | 0.6488 | 0.3254 | Ж | 0.7288 | 1.8741 | 0.5263 | 0.7692 | NaN | NaN | 0.2917 | NaN | 0.6682 | 0.5592 | 0.400 | 0.5871 | 0.4654 | NaN | 0.6055 | 0.0000 | 0.5590 | 0.6183 | 0.4845 | NaN | 0.5471 | 0.4554 | 0.6479 | 0.8240 | 1.4055 | 1.4080 | NaN | NaN | NaN | 0.5208 | NaN | 0.5462 | NaN | 0.1559 | 0.0449 | 0.0000 | 0.8300 | 0.0115 | 0.3846 | NaN | 0.7418 | 0.5004 | 1.2014 | 1.3485 | 0.0000 | 1.2304 | 0.7229 | 0.5943 | 1.5156 | 0.0147 | 0.8036 | 0.6366 | NaN | 0.7793 | 1.2143 | 1.0723 | 1.3947 | 0.0123 | 0.4621 | 0.4864 | 0.7067 | 0.0589 | NaN | NaN | 0.5946 | 0.0823 | NaN | 0.1414 | NaN | 0.8669 | 0.3707 | 0.0000 | 0.7177 | 0.0866 | 1.3485 | NaN | 0.4640 | 0.3956 | 0.1930 | 0.0000 | 0.8019 | 0.1895 | 0.6128 | 2.1596 | 0.6810 | 0.6546 | 0.1300 | 1.2374 | NaN | NaN | 0.0000 | 0.8756 | 0.6718 | 2.0876 | 0 | 0.6055 | 18.0 | 1.3393 | 0.5821 | 0.923077 | 0.071429 | 10.0 | 84.314 | 98.0 | 16.0 | 11.0 | 137.282 | 28.776 | 6.0 | 169.658 | 10.680 | 7.0 | 28.776 | 21.0 | 8.0 | 9.0 | 4469.86 | 658.85 | 1286.32 | 7736.05 | 418.80 | 3233.31 | 811.73 | 2321.61 | 182.82 | 283.84 | 3648.23 | 3141.25 | 356.67 | 237.25 | 283.84 | 3648.23 | 1195.37 | 535.42 | 1.7078 | 0.2798 | 0.3008 |
1 | 57.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 55.99 | 0.0000 | 1.0000 | Ж | 0.0000 | 0.9630 | 0.0000 | 0.0000 | NaN | NaN | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.000 | 0.0000 | 0.0000 | NaN | 1.0102 | 0.0000 | 0.0000 | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 1.0027 | 0.0000 | NaN | NaN | NaN | NaN | 0.0000 | 0.0 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.1094 | 0.0000 | NaN | NaN | 1.1289 | 0.0000 | 0.6188 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.4981 | 0.6382 | NaN | NaN | NaN | 1.1289 | 0.0000 | 0.0000 | 0.4981 | 0.6382 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.2072 | 0.0000 | NaN | NaN | 0.3993 | 0.8333 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.6192 | 0.5405 | NaN | 0.2072 | NaN | NaN | NaN | 0.0000 | 0.0000 | NaN | 0.6192 | 1 | 0.0000 | 4.0 | 0.0000 | 0.0000 | 1.000000 | 0.000000 | 1.0 | 1.000 | 2.0 | 2.0 | 0.0 | 0.000 | 1.000 | 0.0 | 1.744 | 2.000 | 1.0 | 1.000 | 0.0 | 2.0 | 0.0 | 113.39 | 62.69 | 58.71 | 93.35 | 87.01 | 0.00 | 122.98 | 0.00 | 58.71 | 87.01 | 179.83 | 113.39 | 62.69 | 58.71 | 87.01 | 179.83 | 0.00 | 122.98 | 0.0000 | 0.0000 | 0.0000 |
2 | 38.0 | 7.0 | 0.0 | 15.0 | 4.0 | 9.0 | 5.0 | 9.0 | 14.0 | 7.0 | 6.0 | 10.0 | 14.0 | 5.0 | 11.0 | 0.0 | 3.0 | 2.0 | 2.0 | 0.0 | 3.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 2.0 | 6.0 | 0.0 | 9.0 | 2.0 | 5.0 | 1.0 | 7.0 | 7.0 | 8.0 | 3.0 | 2.0 | 6.0 | 6.0 | 3.0 | 4.0 | 0.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 290.00 | 0.3739 | 0.4768 | М | NaN | 0.3295 | NaN | NaN | 0.0 | NaN | 0.0000 | 0.0 | 0.4159 | 0.8485 | 0.000 | 0.0000 | 0.0302 | 0.0 | 0.6009 | 0.6205 | 1.0035 | 0.5712 | 0.5762 | 0.4714 | 0.9830 | 0.0000 | NaN | 0.5559 | NaN | 0.9780 | 0.0 | NaN | NaN | 0.0000 | 0.0 | 0.0078 | NaN | 0.8362 | 1.3183 | 0.8560 | 0.0000 | 0.0000 | 0.6077 | 0.0 | 0.7665 | 1.2056 | NaN | 1.0002 | 0.0541 | 0.8461 | 0.5812 | 0.6945 | 1.7252 | 0.7579 | 0.6608 | 0.8560 | 0.9266 | 1.1554 | 0.7782 | 0.7471 | 2.0674 | 0.8871 | NaN | 0.1201 | NaN | 0.6629 | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0666 | NaN | 0.4668 | 1.3422 | 0.3536 | 0.0000 | 0.0000 | 0.0100 | 0.0 | 0.0457 | 0.2615 | 0.5856 | 0.2870 | 0.5238 | 0.2017 | 0.2840 | 1.8758 | 0.6338 | 0.2654 | 0.0000 | 0.4481 | 0.7673 | 0.2393 | 0.2851 | 0.5170 | 0.2407 | 2.5227 | 0 | 0.7256 | 34.0 | 0.0000 | 0.7256 | 1.000000 | 0.250000 | 5.0 | 21.102 | 50.0 | 109.0 | 0.0 | 0.000 | 7.594 | 0.0 | 25.294 | 11.084 | 3.0 | 11.158 | 31.0 | 59.0 | 0.0 | 1564.91 | 971.09 | 177.93 | 3257.49 | 975.21 | 2555.27 | 6351.29 | 0.00 | 0.00 | 0.00 | 783.87 | 1239.19 | 533.46 | 83.37 | 593.13 | 1217.43 | 1336.83 | 3709.82 | 0.0000 | NaN | 0.0803 |
3 | 65.0 | 6.0 | 3.0 | 25.0 | 2.0 | 10.0 | 14.0 | 11.0 | 8.0 | 1.0 | 0.0 | 2.0 | 6.0 | 7.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | 11.0 | 2.0 | 3.0 | 5.0 | 5.0 | 4.0 | 2.0 | 1.0 | 0.0 | 1.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 51.81 | 0.0000 | 1.0000 | Ж | 0.0000 | 1.4933 | 0.0000 | 0.0000 | NaN | NaN | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.000 | 0.0000 | NaN | NaN | 0.0000 | 0.0000 | NaN | NaN | 0.3295 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.7432 | 0.0000 | 0.1315 | NaN | NaN | NaN | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | NaN | NaN | NaN | 0.7904 | 0.0050 | NaN | NaN | 0.0000 | NaN | 0.0000 | NaN | 0.0166 | 0.5362 | NaN | 0.5780 | 0.1315 | 0.3219 | 1.1290 | NaN | 1.7975 | 1.2530 | 0.0000 | 0.0000 | 0.0000 | 0.2354 | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | NaN | NaN | NaN | 0.1655 | 0.6560 | NaN | 0.0000 | NaN | 0.0000 | NaN | 0.1326 | 0.1477 | NaN | 0.7469 | 0.2352 | 0.2354 | 0.6846 | NaN | 0.2671 | 0.1028 | 3.0736 | 1 | 0.0000 | 40.0 | 0.0000 | 0.0000 | 0.909091 | 0.000000 | 2.0 | 12.544 | 49.0 | 39.0 | 0.0 | 0.000 | 2.778 | 0.0 | 2.000 | 34.212 | 2.0 | 3.778 | 2.0 | 13.0 | 0.0 | 358.22 | 3798.18 | 680.93 | 1425.07 | 175.73 | 602.81 | 3544.76 | 0.00 | 119.99 | 73.24 | 346.74 | 139.68 | 1849.91 | 360.40 | 175.73 | 496.73 | 172.58 | 1246.21 | 0.0000 | 0.0000 | 0.0000 |
4 | 61.0 | 0.0 | 1.0 | 2.0 | 0.0 | 2.0 | 1.0 | 0.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 2.0 | 1.0 | 0.0 | 8.0 | 2.0 | 2.0 | 1.0 | 1.0 | 4.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 4.0 | 1.0 | 1.0 | 2.0 | 4.0 | 2.0 | 161.12 | 0.2882 | 0.2882 | Ж | 0.9301 | 0.9014 | 0.8165 | 0.7542 | 0.0 | NaN | 0.0000 | 0.0 | NaN | 0.0000 | 0.000 | 0.0000 | 0.6682 | 0.0 | 0.7781 | 0.0000 | 0.0000 | 0.4826 | 0.7526 | 0.0000 | NaN | 0.4714 | 0.4714 | 0.9980 | 1.3497 | 0.0000 | 0.0 | NaN | 0.0000 | 0.0000 | 0.0 | NaN | NaN | 0.0000 | 0.0000 | 1.2273 | 0.0000 | 0.0249 | 0.9172 | 0.0 | NaN | 0.6549 | 0.0000 | 0.6684 | 0.4566 | 0.0000 | 0.0000 | 0.0138 | 1.2899 | 1.6070 | 1.0710 | 0.9058 | 0.0000 | 0.5955 | NaN | NaN | 1.6232 | 1.3194 | 0.4903 | 0.4903 | 0.9423 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0 | NaN | NaN | 0.0000 | 0.0000 | 0.2284 | 0.0000 | 0.1325 | 0.2146 | 0.0 | NaN | 0.1934 | 0.5189 | 0.0058 | 0.0000 | 0.0000 | 0.3868 | 2.0293 | 0.6860 | 0.6113 | 1.0926 | 0.2211 | 0.0000 | 0.0058 | 0.2496 | NaN | 0.2195 | 1.4917 | 0 | 0.7128 | 20.0 | 0.0000 | 0.7865 | 1.000000 | 0.100000 | 0.0 | 1.454 | 25.0 | 25.0 | 0.0 | 0.000 | 0.454 | 0.0 | 3.036 | 12.000 | 0.0 | 1.454 | 8.0 | 23.0 | 0.0 | 226.98 | 168.05 | 960.37 | 1560.21 | 0.00 | 342.45 | 1039.85 | 0.00 | 66.18 | 0.00 | 87.94 | 226.98 | 168.05 | 461.37 | 0.00 | 237.93 | 225.51 | 995.27 | 1.4142 | 0.3495 | 0.3495 |
687024 | 35.0 | 0.0 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 1.0 | 0.0 | 3.0 | 2.0 | 2.0 | 3.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 2.0 | 0.0 | 0.0 | 5.0 | 0.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 130.90 | 0.6214 | 0.6924 | Ж | 1.0931 | 1.1344 | 0.8240 | 0.8281 | NaN | 0.4714 | 0.4359 | NaN | 0.4714 | NaN | 0.441 | 0.4359 | NaN | 0.0 | 0.4714 | 0.0000 | NaN | 0.4359 | NaN | 0.0000 | NaN | 0.6614 | 0.5162 | 0.5162 | 1.0710 | 0.0000 | NaN | 1.1557 | 1.1622 | 0.0058 | NaN | 1.1557 | 1.0531 | NaN | 1.3709 | 0.0000 | 0.0058 | NaN | NaN | 0.0 | 0.2821 | 1.1557 | 0.0000 | 1.0531 | 0.0000 | NaN | 0.8648 | 1.2152 | 1.3709 | 0.0000 | 0.0058 | NaN | 0.0000 | 0.7422 | NaN | 0.8648 | 1.4424 | NaN | 0.6335 | 0.6177 | 0.9331 | 0.0000 | 0.0437 | 1.3395 | 0.0223 | NaN | NaN | 0.0437 | 0.5187 | NaN | 1.6498 | 0.0000 | 0.0223 | NaN | NaN | 0.0 | 0.0444 | 0.0437 | 0.5187 | 0.0000 | NaN | 0.4438 | 1.0347 | 1.6498 | 0.0000 | 0.0223 | NaN | NaN | 0.0000 | 0.0000 | 0.1156 | NaN | 1.0347 | 1.7847 | 0 | 0.5756 | 59.0 | 1.3333 | 0.4002 | 0.000000 | 0.166667 | 0.0 | 3.000 | 14.0 | 2.0 | 0.0 | 19.856 | 3.000 | 0.0 | 19.856 | 29.000 | 0.0 | 3.000 | 15.0 | 1.0 | 0.0 | 550.09 | 695.32 | 111.87 | 114.21 | 0.00 | 1173.84 | 147.68 | 550.09 | 111.87 | 0.00 | 330.96 | 550.09 | 669.33 | 111.87 | 0.00 | 330.96 | 1173.84 | 119.99 | 2.6458 | 0.3646 | 0.3282 |
687025 | 33.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0000 | 0.0000 | М | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.2382 | NaN | 0 | 0.0000 | 4.0 | 0.0000 | 0.0000 | 1.000000 | 0.000000 | 0.0 | 0.000 | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.000 | 0.000 | 0.0 | 0.000 | 0.0 | 1.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 28.01 | NaN | NaN | NaN | NaN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 28.01 | 0.0000 | 0.0000 | 0.0000 |
687026 | 36.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.9847 | 0.9847 | М | NaN | NaN | NaN | NaN | 0.0 | 0.0000 | 0.0000 | 0.0 | 0.0000 | NaN | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0 | 0.0000 | NaN | 0.0000 | 0.0 | 0.0000 | 0.0000 | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | NaN | 0.0000 | NaN | NaN | NaN | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | NaN | 1 | 0.9847 | 66.0 | 0.0000 | 0.9847 | 1.000000 | 0.000000 | 0.0 | 0.000 | 5.0 | 3.0 | 0.0 | 0.000 | 0.000 | 0.0 | 0.000 | 0.000 | 0.0 | 0.000 | 15.0 | 0.0 | 0.0 | 0.00 | 155.97 | 23.99 | 41.51 | 0.00 | 615.77 | 87.47 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 449.01 | 0.00 | 0.0000 | NaN | NaN |
687027 | 37.0 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.3545 | 0.7263 | М | NaN | 0.2269 | NaN | NaN | 0.0 | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | NaN | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.0000 | NaN | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | NaN | NaN | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.0000 | NaN | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | 0 | 0.8318 | 9.0 | 0.0000 | 0.8318 | 1.000000 | 0.000000 | 0.0 | 0.000 | 1.0 | 0.0 | 0.0 | 0.000 | 0.000 | 0.0 | 0.000 | 0.476 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.00 | 81.90 | 29.82 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 46.72 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0000 | NaN | NaN |
687028 | 40.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 2.0 | 2.0 | 2.0 | 2.0 | 3.0 | 1.0 | 1.0 | 2.0 | 1.0 | 4.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 2.0 | 0.0 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.00 | 0.0000 | 0.0000 | Ж | 0.0000 | 0.8408 | 0.0000 | 0.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.9895 | 0.9970 | 0.0 | 0.0000 | 0.0000 | 0.6667 | 0.9895 | 0.9970 | 0.0000 | 0.6667 | 0.0000 | 0.0000 | 0.3536 | 0.0000 | 0.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5264 | 0.0000 | 0.2564 | 0.0 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.1059 | 0.5745 | 0.2423 | 0.6481 | 0.3924 | 0.5264 | 0.2564 | 0.0000 | 0.0000 | 1.1059 | 0.5745 | 0.6481 | 0.3924 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5179 | 0.0000 | 0.2672 | 0.0 | NaN | 0.0000 | 0.0000 | 0.0000 | 0.5834 | 0.1777 | 0.2541 | 1.2352 | 0.3381 | 0.5179 | 0.0000 | 0.2672 | 0.0000 | 0.0000 | NaN | 0.5834 | 0.2541 | 1.2352 | 0 | 0.0000 | 13.0 | 0.0000 | 0.0000 | 1.000000 | 0.100000 | 0.0 | 6.452 | 25.0 | 17.0 | 3.0 | 6.660 | 1.344 | 1.0 | 6.660 | 0.000 | 0.0 | 1.344 | 18.0 | 4.0 | 1.0 | 531.25 | 0.00 | 0.00 | 916.44 | 0.00 | 2407.56 | 1304.03 | 290.01 | 0.00 | 0.00 | 228.47 | 290.01 | 0.00 | 0.00 | 0.00 | 228.47 | 752.32 | 596.86 | 0.0000 | 0.0000 | 0.0000 |
# check NaN values ratio
pd.DataFrame({"Total" : dataset.data.isna().sum().sort_values(ascending = False),
"Percentage" : round(dataset.data.isna().sum().sort_values(ascending = False) / len(dataset.data), 3)}).head(20)
Total | Percentage | |
---|---|---|
k_var_sku_price_15d_g49 | 496259 | 0.722 |
k_var_disc_share_15d_g49 | 496159 | 0.722 |
k_var_count_per_cheque_15d_g34 | 468551 | 0.682 |
k_var_sku_price_15d_g34 | 468551 | 0.682 |
k_var_disc_share_15d_g34 | 468467 | 0.682 |
k_var_count_per_cheque_15d_g24 | 442121 | 0.644 |
k_var_disc_share_15d_g24 | 442054 | 0.643 |
k_var_sku_price_1m_g49 | 414473 | 0.603 |
k_var_count_per_cheque_1m_g49 | 414473 | 0.603 |
k_var_disc_share_1m_g49 | 414369 | 0.603 |
k_var_sku_price_1m_g54 | 388217 | 0.565 |
k_var_disc_share_1m_g54 | 388139 | 0.565 |
k_var_sku_price_1m_g34 | 385078 | 0.560 |
k_var_count_per_cheque_1m_g34 | 385078 | 0.560 |
k_var_disc_share_1m_g34 | 384997 | 0.560 |
k_var_sku_price_1m_g44 | 383315 | 0.558 |
k_var_count_per_cheque_1m_g44 | 383315 | 0.558 |
k_var_disc_share_1m_g44 | 383219 | 0.558 |
k_var_sku_price_1m_g40 | 380641 | 0.554 |
k_var_disc_share_1m_g40 | 380559 | 0.554 |
print('Total missed data percentage:',
round(100*dataset.data.isna().sum().sum()/(dataset.data.shape[0]*dataset.data.shape[1]), 2), '%')
Total missed data percentage: 19.34 %
Transform categorical columns gender
and treatment
into binary.
# make treatment binary
treat_dict = {
'test': 1,
'control': 0
}
dataset.treatment = dataset.treatment.map(treat_dict)
# make gender binary
gender_dict = {
'M': 1,
'Ж': 0
}
dataset.data.gender = dataset.data.gender.map(gender_dict)
f = plt.figure(figsize=(19, 15))
plt.matshow(dataset.data.corr(), fignum=f.number)
cb = plt.colorbar()
cb.ax.tick_params(axis=u'both', which=u'both',length=0)
plt.title('Correlation Matrix', fontsize=16);
Intuition:
In a binary classification problem definition we stratify train set by splitting target 0/1
column. In uplift modeling we have two columns instead of one.
stratify_cols = pd.concat([dataset.treatment, dataset.target], axis=1)
X_train, X_val, trmnt_train, trmnt_val, y_train, y_val = train_test_split(
dataset.data,
dataset.treatment,
dataset.target,
stratify=stratify_cols,
test_size=0.3,
random_state=42
)
print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")
Train shape: (480920, 193) Validation shape: (206109, 193)
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
estimator = CatBoostClassifier(verbose=100,
random_state=42,
thread_count=1)
ct_model = ClassTransformation(estimator=estimator)
my_pipeline = Pipeline([
('imputer', imp_mode),
('model', ct_model)
])
Usual fit pipeline but with aditional treatment parameter
model__treatment = trmnt_train
.
my_pipeline = my_pipeline.fit(
X=X_train,
y=y_train,
model__treatment=trmnt_train
)
Learning rate set to 0.143939 0: learn: 0.6695107 total: 421ms remaining: 7m 100: learn: 0.5950043 total: 34.2s remaining: 5m 4s 200: learn: 0.5908539 total: 1m 5s remaining: 4m 21s 300: learn: 0.5870115 total: 1m 39s remaining: 3m 51s 400: learn: 0.5835003 total: 2m 13s remaining: 3m 19s 500: learn: 0.5800551 total: 2m 47s remaining: 2m 46s 600: learn: 0.5768127 total: 3m 21s remaining: 2m 13s 700: learn: 0.5736896 total: 3m 54s remaining: 1m 39s 800: learn: 0.5706878 total: 4m 27s remaining: 1m 6s 900: learn: 0.5676374 total: 5m 2s remaining: 33.2s 999: learn: 0.5647908 total: 5m 35s remaining: 0us
Predict uplift and calculate basic uplift metric uplift@30% at first 30%. Read more about the metric in docs.
uplift_predictions = my_pipeline.predict(X_val)
uplift_30 = uplift_at_k(y_val, uplift_predictions, trmnt_val, strategy='overall')
print(f'uplift@30%: {uplift_30:.4f}')
uplift@30%: 0.0504