The lale.lib.lale.AutoPipeline
operator automatically creates a pipeline for a dataset.
It is designed for simplicity, requiring minimum configuration to get started.
You can use it for an initial experiment with new data. See also the
API documentation.
import lale.datasets.openml
import lale.helpers
(orig_train_X, train_y), (test_X, test_y) = \
lale.datasets.openml.fetch('credit-g', 'classification', preprocess=False)
train_X = lale.helpers.add_missing_values(orig_train_X, seed=42)
print(f'train_X.shape {train_X.shape}')
train_X.shape (670, 20)
Printing the last few samples of the training data reveals
credit_amount=NaN
for sample number 763.
import pandas as pd
pd.options.display.max_columns = None
pd.concat([train_y.tail(6), train_X.tail(6)], axis=1)
class | checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | residence_since | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
763 | bad | no checking | 21.0 | critical/other existing credit | new car | NaN | no known savings | >=7 | 4.0 | male single | none | 4.0 | no known property | 30.0 | none | for free | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes |
835 | bad | <0 | 12.0 | no credits/all paid | new car | 1082.0 | <100 | 1<=X<4 | 4.0 | male single | none | 4.0 | car | 48.0 | bank | own | 2.0 | skilled | 1.0 | none | yes |
192 | bad | 0<=X<200 | 27.0 | existing paid | business | 3915.0 | <100 | 1<=X<4 | 4.0 | male single | none | 2.0 | car | 36.0 | none | own | 1.0 | skilled | 2.0 | yes | yes |
629 | good | no checking | 9.0 | existing paid | education | 3832.0 | no known savings | >=7 | 1.0 | male single | none | 4.0 | real estate | 64.0 | none | own | 1.0 | unskilled resident | 1.0 | none | yes |
559 | bad | 0<=X<200 | 18.0 | critical/other existing credit | furniture/equipment | 1928.0 | <100 | <1 | 2.0 | male single | none | 2.0 | real estate | 31.0 | none | own | 2.0 | unskilled resident | 1.0 | none | yes |
684 | good | 0<=X<200 | 36.0 | delayed previously | business | 9857.0 | 100<=X<500 | 4<=X<7 | 1.0 | male single | none | 3.0 | life insurance | 31.0 | none | own | 2.0 | unskilled resident | 2.0 | yes | yes |
We designed the AutoPipeline operator to follow sklearn's init-fit-predict conventions to make it easy to use for anyone familiar with sklearn.
prediction_type='classification'
; AutoPipeline also supports
'regression'
. The max_opt_time
is the timeout in seconds for finding a
pipeline for the dataset.fit
initiates training, which tries out various pipelines
on the dataset.predict
makes predictions using the best found
pipeline.from lale.lib.lale import AutoPipeline
trainable = AutoPipeline(prediction_type='classification', max_opt_time=90)
%%time
trained = trainable.fit(train_X, train_y)
CPU times: user 29.9 s, sys: 1.59 s, total: 31.5 s Wall time: 1min 39s
predicted = trained.predict(test_X)
The result of predict
can then be either used directly (e.g., printed),
or passed to other sklearn functions (e.g., for scoring).
from sklearn.metrics import accuracy_score
print(f'last 10 predictions: {predicted[-6:]}')
print(f'accuracy {accuracy_score(predicted, test_y):.1%}')
last 10 predictions: ['bad' 'bad' 'good' 'bad' 'good' 'bad'] accuracy 73.9%
After training, you can look at a leaderboard of all the pipelines tried
during the search by calling the summary
method.
trained.summary()
loss | time | log_loss | status | tid | |
---|---|---|---|---|---|
name | |||||
gbt_all | -0.705970 | 2.426831 | 0.721055 | ok | NaN |
p2 | -0.704478 | 2.649456 | 0.561390 | ok | 2.0 |
p1 | -0.698507 | 1.591414 | 0.570718 | ok | 1.0 |
baseline | -0.695522 | 0.077895 | NaN | ok | NaN |
p4 | -0.695522 | 2.758111 | NaN | ok | 4.0 |
p0 | -0.686567 | 1.758013 | 0.616135 | ok | 0.0 |
p3 | -0.683582 | 1.777791 | 0.765601 | ok | 3.0 |
gbt_num | -0.641791 | 1.130019 | 0.768635 | ok | NaN |
p5 | NaN | NaN | NaN | new | 5.0 |
The get_pipeline
method lets you retrieve any pipeline by name.
By default, when no name is specified, it returns the best pipeline
found, in other words, the pipeline with the lowest loss in the
leaderboard.
You can call predict
on that pipeline in typical sklearn fashion.
Furthermore, you can inspect that pipeline by calling visualize
.
best_found = trained.get_pipeline()
best_found.visualize()
The visualization reveals the operators from which the pipeline is
composed. Since our example dataset contains both numeric features and
categorical features, the pipeline contains two preprocessing paths,
using Project
to keep only the relevant columns.
When you hover the mouse pointer over an operator in the visualization,
a tooltip shows how it is configured. Furthermore, if you click on an
operator in the visualization, you get to a documentation page
for that operator.
You can also pretty-print any pipeline found during the search back as
Python code to better understand how its operators were configured.
best_found.pretty_print(ipython_display=True)
from lale.lib.lale import Project
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from lale.lib.lale import ConcatFeatures
from xgboost import XGBClassifier
import lale
lale.wrap_imported_operators()
project_0 = Project(
columns={"type": "number"},
drop_columns=lale.lib.lale.categorical(max_values=5),
)
project_1 = Project(columns=lale.lib.lale.categorical(max_values=5))
simple_imputer_1 = SimpleImputer(strategy="most_frequent")
one_hot_encoder = OneHotEncoder(handle_unknown="ignore")
pipeline = (
(
(project_0 >> SimpleImputer())
& (project_1 >> simple_imputer_1 >> one_hot_encoder)
)
>> ConcatFeatures
>> XGBClassifier()
)
You can try out the AutoPipeline operator on your own data. Alternatively, if you want more control, you also plan your own pipelines, then use Lale to do automated algorithm selection and hyperparameter tuning on them. Check out the other example notebooks for Lale, and in particular, the introductory guide.