Pipelines are just a series of steps you perform on data in sklearn
. (The sklearn
guide to them is here.)
A "typical" pipeline in ML projects
{tip}
You can set up pipelines with `make_pipeline`.
{margin}
<img src="https://media.giphy.com/media/k5b6fkFnSA3yo/source.gif" alt="Mario" style="width:200px;">
For example, here is a simple pipeline:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
# set_config(pandas) might slow down sklearn, but should be used during development to
# facilitate EDA/ABCD, bc we can see the transformed data with variable names
from sklearn import set_config
set_config(transform_output="pandas")
ridge_pipe = make_pipeline(SimpleImputer(),Ridge(1.0))
You put a series of steps inside make_pipeline
, separated by commas.
The pipeline object (printed out below) is a list of steps, where each step has a name (e.g. "simpleimputer" ) and a task associated with that name (e.g. "SimpleImputer()").
ridge_pipe
Pipeline(steps=[('simpleimputer', SimpleImputer()), ('ridge', Ridge())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('simpleimputer', SimpleImputer()), ('ridge', Ridge())])
SimpleImputer()
Ridge()
{tip}
You can `.fit()` and `.predict()` pipelines like any model, and they can be used in `cross_validate` too!
Using it is the same as using any estimator! After I load the data we've been using from the last two pages below (hidden), we can fit and predict like on the "one model intro" page:
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_validate
url = 'https://github.com/LeDataSciFi/data/blob/main/Fannie%20Mae/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
.assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
)
.iloc[:,-11:]
)
rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)
ridge_pipe.fit(X_train,y_train)
ridge_pipe.predict(X_test)
array([5.95256433, 4.20060942, 3.9205946 , ..., 4.06401663, 5.30024985, 7.32600213])
Those are the same numbers as before - good!
We can use this pipeline in our cross validation in place of the estimator:
cross_validate(ridge_pipe,X_train,y_train,
cv=KFold(5), scoring='r2')['test_score'].mean()
0.9030537085469961
{warning}
(Virtually) All preprocessing should be done in the pipeline!
This is the link you should start with to see how you might clean and preprocess data. Key preprocessing steps include
With real-world data, you'll have many data types. So the preprocessing steps you apply to one column won't necessarily be what the next column needs.
I use ColumnTransformer to assemble my preprocessing portion of my full pipeline, and it allows me to process different variables differently.
The generic steps to preprocess in a pipeline:
ColumnTransformer()
is a function, so it needs the parentheses "()"ColumnTransformer([])
)ColumnTransformer([<here!>])
ColumnTransformer
set as the first step inside your glorious estimation pipeline.So, let me put this together:
{tip}
This is good pseudo!
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
#############
# Step 1: how to deal with numerical vars
# pro-tip: you might set up several numeric pipelines, because
# some variables might need very different treatment!
#############
numer_pipe = make_pipeline(SimpleImputer())
# this deals with missing values (somehow?)
# you might also standardize the vars in this numer_pipe
#############
# Step 2: how to deal with categorical vars
#############
cat_pipe = make_pipeline(OneHotEncoder(drop='first',sparse_output=False))
# notes on this cat pipe:
# OneHotEncoder is just one way to deal with categorical vars
# drop='first' is necessary if the model is regression
# sparse_output=False might slow down sklearn, BUT IT MUST BE USED WITH set_config(pandas) !!!
#############
# Step 3: combine the subparts
#############
preproc_pipe = ColumnTransformer(
[ # arg 1 of ColumnTransformer is a list, so this starts the list
# a tuple for the numerical vars: name, pipe, which vars to apply to
("num_impute", numer_pipe, ['l_credscore','TCMR']),
# a tuple for the categorical vars: name, pipe, which vars to apply to
("cat_trans", cat_pipe, ['Property_state'])
]
, remainder = 'drop' # you either drop or passthrough any vars not modified above
)
#############
# Step 4: put the preprocessing into an estimation pipeline
#############
new_ridge_pipe = make_pipeline(preproc_pipe,Ridge(1.0))
The data loaded above has no categorical variables, so I'm going to reload the data and keep new variables to illustrate what we can do:
'TCMR','l_credscore'
are numerical'Property_state'
is categorical'l_LTV'
is in the data, but should be dropped (because of remainder='drop'
)So here is the raw data:
url = 'https://github.com/LeDataSciFi/data/blob/main/Fannie%20Mae/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip').dropna()
y = fannie_mae.Original_Interest_Rate
fannie_mae = (fannie_mae
.assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
)
[['TCMR', 'Property_state', 'l_credscore', 'l_LTV']]
)
rng = np.random.RandomState(0) # this helps us control the randomness so we can reproduce results exactly
X_train, X_test, y_train, y_test = train_test_split(fannie_mae, y, random_state=rng)
display(X_train.head())
display(X_train.describe().T.round(2))
TCMR | Property_state | l_credscore | l_LTV | |
---|---|---|---|---|
4326 | 4.651500 | IL | 6.670766 | 4.499810 |
15833 | 4.084211 | TN | 6.652863 | 4.442651 |
66753 | 3.675000 | MO | 6.635947 | 4.442651 |
23440 | 3.998182 | MO | 6.548219 | 4.553877 |
4155 | 4.651500 | CO | 6.602588 | 4.442651 |
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
TCMR | 7938.0 | 3.36 | 1.29 | 1.50 | 2.21 | 3.00 | 4.45 | 6.66 |
l_credscore | 7938.0 | 6.60 | 0.07 | 6.27 | 6.55 | 6.61 | 6.66 | 6.72 |
l_LTV | 7938.0 | 4.51 | 0.05 | 4.25 | 4.49 | 4.50 | 4.55 | 4.57 |
We could .fit()
and .transform()
using the preproc_pipe
from step 3 (or just .fit_transform()
to do it in one command) to see how it transforms the data.
transformed_Xtrain = preproc_pipe.fit_transform(X_train)
transformed_Xtrain
num_impute__l_credscore | num_impute__TCMR | cat_trans__Property_state_AL | cat_trans__Property_state_AR | cat_trans__Property_state_AZ | cat_trans__Property_state_CA | cat_trans__Property_state_CO | cat_trans__Property_state_CT | cat_trans__Property_state_DC | cat_trans__Property_state_DE | ... | cat_trans__Property_state_SD | cat_trans__Property_state_TN | cat_trans__Property_state_TX | cat_trans__Property_state_UT | cat_trans__Property_state_VA | cat_trans__Property_state_VT | cat_trans__Property_state_WA | cat_trans__Property_state_WI | cat_trans__Property_state_WV | cat_trans__Property_state_WY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4326 | 6.670766 | 4.651500 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
15833 | 6.652863 | 4.084211 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
66753 | 6.635947 | 3.675000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
23440 | 6.548219 | 3.998182 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4155 | 6.602588 | 4.651500 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
123118 | 6.650279 | 1.556522 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
69842 | 6.647688 | 2.416364 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
51872 | 6.507278 | 6.054000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
128800 | 6.618739 | 2.303636 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
46240 | 6.639876 | 4.971304 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7938 rows × 53 columns
Notice
l_LTV
column is gone!display(transformed_Xtrain
.describe().T.round(2)
.iloc[:7,:]) # only show a few variables for space...
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
num_impute__l_credscore | 7938.0 | 6.60 | 0.07 | 6.27 | 6.55 | 6.61 | 6.66 | 6.72 |
num_impute__TCMR | 7938.0 | 3.36 | 1.29 | 1.50 | 2.21 | 3.00 | 4.45 | 6.66 |
cat_trans__Property_state_AL | 7938.0 | 0.02 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
cat_trans__Property_state_AR | 7938.0 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
cat_trans__Property_state_AZ | 7938.0 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
cat_trans__Property_state_CA | 7938.0 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
cat_trans__Property_state_CO | 7938.0 | 0.03 | 0.16 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
.fit()
and .predict()
, put into CVs