This is a simple example on how to use sklift.models with sklearn.pipeline.
The data is taken from MineThatData E-Mail Analytics And Data Mining Challenge dataset by Kevin Hillstrom.
This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test:
During a period of two weeks following the e-mail campaign, results were tracked. The task is to tell the world if the Mens or Womens e-mail campaign was successful.
The full description of the dataset can be found at the link.
Firstly, install the necessary libraries:
!pip install scikit-uplift==0.1.2 xgboost==1.0.2 category_encoders==2.1.0
Secondly, load the data:
import urllib.request
import pandas as pd
csv_path = '/content/Hilstorm.csv'
url = 'http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv'
urllib.request.urlretrieve(url, csv_path)
('./content/Hilstorm.csv', <http.client.HTTPMessage at 0x117de0438>)
For simplicity of the example, we will leave only two user segments:
We will use the visit
variable as the target variable.
import pandas as pd
%matplotlib inline
dataset = pd.read_csv(csv_path)
print(f'Shape of the dataset before processing: {dataset.shape}')
dataset = dataset[dataset['segment']!='Mens E-Mail']
dataset.loc[:, 'treatment'] = dataset['segment'].map({
'Womens E-Mail': 1,
'No E-Mail': 0
})
dataset = dataset.drop(['segment', 'conversion', 'spend'], axis=1)
print(f'Shape of the dataset after processing: {dataset.shape}')
dataset.head()
Shape of the dataset before processing: (64000, 12) Shape of the dataset after processing: (42693, 10)
recency | history_segment | history | mens | womens | zip_code | newbie | channel | visit | treatment | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10 | 2) $100 - $200 | 142.44 | 1 | 0 | Surburban | 0 | Phone | 0 | 1 |
1 | 6 | 3) $200 - $350 | 329.08 | 1 | 1 | Rural | 1 | Web | 0 | 0 |
2 | 7 | 2) $100 - $200 | 180.65 | 0 | 1 | Surburban | 1 | Web | 0 | 1 |
4 | 2 | 1) $0 - $100 | 45.34 | 1 | 0 | Urban | 0 | Web | 0 | 1 |
5 | 6 | 2) $100 - $200 | 134.83 | 0 | 1 | Surburban | 0 | Phone | 1 | 1 |
Divide all the data into a training and validation sample:
from sklearn.model_selection import train_test_split
Xyt_tr, Xyt_val = train_test_split(dataset, test_size=0.5, random_state=42)
X_tr = Xyt_tr.drop(['visit', 'treatment'], axis=1)
y_tr = Xyt_tr['visit']
treat_tr = Xyt_tr['treatment']
X_val = Xyt_val.drop(['visit', 'treatment'], axis=1)
y_val = Xyt_val['visit']
treat_val = Xyt_val['treatment']
Select categorical features:
cat_cols = X_tr.select_dtypes(include='object').columns.tolist()
print(cat_cols)
['history_segment', 'zip_code', 'channel']
Create the necessary objects and combining them into a pipieline:
from sklearn.pipeline import Pipeline
from category_encoders import CatBoostEncoder
from sklift.models import ClassTransformation
from xgboost import XGBClassifier
encoder = CatBoostEncoder(cols=cat_cols)
estimator = XGBClassifier(max_depth=2, random_state=42)
ct = ClassTransformation(estimator=estimator)
my_pipeline = Pipeline([
('encoder', encoder),
('model', ct)
])
Train pipeline as usual, but adding the treatment column in the step model as a parameter model__treatment
.
my_pipeline = my_pipeline.fit(
X=X_tr,
y=y_tr,
model__treatment=treat_tr
)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py:354: UserWarning: It is recommended to use this approach on treatment balanced data. Current sample size is unbalanced. self._final_estimator.fit(Xt, y, **fit_params)
Predict the uplift and calculate the uplift@30%
from sklift.metrics import uplift_at_k
uplift_predictions = my_pipeline.predict(X_val)
uplift_30 = uplift_at_k(y_val, uplift_predictions, treat_val, strategy='overall')
print(f'uplift@30%: {uplift_30:.4f}')
uplift@30%: 0.0661