Lale comes with several library operators, so you do not need to write your own. But if you want to contribute new operators, this tutorial is for you. First let us review some basic concepts in Lale from the point of view of adding new operators (estimators and transformers). Lale is a library for semi-automated data science, and is designed for the following goals:
To enable the above properties for your operators with Lale, you need to:
__init__
,
fit
, and predict
or transform
. If you have a custom estimator
or transformer as per scikit-learn, you can skip this step as that
is already a valid Lale operator.lale.operators.make_operator
.
This step automatically creates a JSON schema skeleton for your operator.abc
only supports penalty xyz
.The next sections illustrate these five steps using an example. After the example-driven sections, this document concludes with a reference covering features from the example and beyond. This document focuses on individual operators. Pipelines that compose multiple operators are documented elsewhere.
This section can be skipped if you already have a scikit-learn
compatible estimator or transformer class with methods __init__
,
fit
, and predict
or transform
. Any other compatibility with
scikit-learn such as get_params
or set_params
is optional, and so
is extending from sklearn.base.BaseEstimator
.
This section illustrates how to implement this class with the help of
an example. The running example in this document is a simple custom
operator that just wraps the LogisticRegression
estimator from
scikit-learn. Of course you can write a similar class to wrap your own
operators, which do not need to come from scikit-learn. The following
code defines a class MyLRImpl
.
import sklearn.linear_model
class _MyLRImpl:
def __init__(self, **hyperparams):
self._wrapped_model = sklearn.linear_model.LogisticRegression(
**hyperparams)
def fit(self, X, y):
self._wrapped_model.fit(X, y)
return self
def predict(self, X, **kwargs):
return self._wrapped_model.predict(X, **kwargs)
This code first imports the relevant scikit-learn package. Then, it declares
a new class for wrapping it. Currently, Lale only supports Python, but
eventually, it will also support other programming languages. Therefore, the
Lale approach for wrapping new operators carefully avoids depending too much
on the Python language or any particular Python library. Hence, the
MyLRImpl
class does not need to inherit from anything, but it does need to
follow certain conventions:
It has a constructor, __init__
, whose arguments are the
hyperparameters.
It has a training method, fit
, with an argument X
containing the
training examples and, in the case of supervised models, an argument y
containing labels. The fit
method creates an instance of the scikit-learn
LogisticRegression
operator, trains it, and returns the wrapper object.
It has a prediction method, predict
for an estimator or transform
for
a transformer. The method has an argument X
containing the test examples
and returns the labels for predict
or the transformed data for
transform
.
These conventions are designed to be similar to those of scikit-learn. However, they avoid a code dependency upon scikit-learn.
Note that in a simple example like this, the underlying
sklearn.linear_model.LogisticRegression
class could be used directly,
without needing the _MyLRImpl
wrapper. However, creating such a wrapper
is useful for more complicated examples.
We can now register _MyLRImpl
as a new Lale operator MyLR
.
import lale.operators
MyLR = lale.operators.make_operator(_MyLRImpl)
The call to make_operator
automatically creates a skeleton JSON schema for
the hyperparameters and the operator methods and attaches it to MyLR
.
from lale.pretty_print import ipython_display
ipython_display(MyLR._schemas)
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Schema for <class 'type'> auto-generated by lale.type_checking.get_default_schema().",
"type": "object",
"tags": {"pre": [], "op": ["estimator"], "post": []},
"properties": {
"hyperparams": {
"allOf": [
{
"type": "object",
"properties": {},
"relevantToOptimizer": [],
}
]
},
"input_fit": {
"type": "object",
"properties": {
"X": {"laleType": "Any"},
"y": {"laleType": "Any"},
},
"additionalProperties": false,
"required": ["X", "y"],
},
"input_predict": {
"type": "object",
"properties": {"X": {"laleType": "Any"}},
"additionalProperties": false,
"required": ["X"],
},
"output_predict": {"laleType": "Any"},
},
}
Lale requires schemas both for error-checking and for generating search
spaces for hyperparameter optimization.
The schemas of a Lale operator specify the space of valid values for
hyperparameters, for the arguments to fit
and predict
or transform
,
and for the output of predict
or transform
. To keep the schemas
independent of the Python programming language, they are expressed as
JSON Schema.
JSON Schema is currently a draft standard and is already being widely
adopted and implemented, for instance, as part of specifying
Swagger APIs.
The schema of a Lale operator can be incrementally customized using the customize_schema
method wich returns a copy of the operator with the customized schema. The customize_schema
method also validates the new schema for early error reporting.
Instead of manually writing the schemas -- which can be error prone -- we provide a dedicated API to help the authoring of operator schemas.
The running example chooses hyperparameters of scikit-learn LogisticRegression that illustrate all the interesting cases. More complete and elaborate examples can be found in the Lale standard library. The following specifies each hyperparameter one at a time, omitting cross-cutting constraints.
from lale.schemas import Null, Enum, Int, Float, Object, Array, Not, AnyOf
MyLR = MyLR.customize_schema(
relevantToOptimizer=['solver', 'penalty', 'C'],
solver=Enum(desc='Algorithm for optimization problem.',
values=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
default='liblinear'),
penalty=Enum(desc='Norm used in the penalization.',
values=['l1', 'l2'],
default='l2'),
C=Float(desc='Inverse regularization strength. '
'Smaller values specify stronger regularization.',
minimum=0.0, exclusiveMinimum=True,
minimumForOptimizer=0.03125, maximumForOptimizer=32768,
distribution='loguniform',
default=1.0))
Here, solver
and penalty
are categorical hyperparameters and C
is a
continuous hyperparameter. For all three hyperparameters, the schema
includes a description, used for interactive documentation, and a
default value, used when no explicit value is specified. The categorical
hyperparameters are then specified as enumerations of their legal values.
In contrast, the continuous hyperparameter is a number, and the schema
includes additional information such as its distribution, minimum, and
maximum. In the example, C
has 'minimum': 0.0
, indicating that only
positive values are valid. Furthermore, C
has a
'minimumForOptimizer': 0.03125
and 'maxmumForOptimizer': 32768
,
guiding the optimizer to limit its search space.
Besides specifying hyperparameters one at a time, users may also want to specify cross-cutting constraints to further restrict the hyperparameter schema. This part is an advanced use case and can be skipped by novice users.
MyLR = MyLR.customize_schema(
constraint=AnyOf([Object(solver=Not(Enum(['newton-cg', 'sag', 'lbfgs']))),
Object(penalty=Enum(['l2']))]))
In JSON schema, allOf
is a logical "and", anyOf
is a logical "or", and
not
is a logical negation. Thus, the anyOf
part of the example can be
read as
assert not (solver in ['newton-cg', 'sag', 'lbfgs']) or penalty == 'l2'
By standard Boolean rules, this is equivalent to a logical implication:
if solver in ['newton-cg', 'sag', 'lbfgs']:
assert penalty == 'l2'
The complete hyperparameters schema simply combines the ranges with the constraints:
ipython_display(MyLR.hyperparam_schema())
{
"allOf": [
{
"type": "object",
"properties": {
"solver": {
"default": "liblinear",
"description": "Algorithm for optimization problem.",
"enum": [
"newton-cg", "lbfgs", "liblinear", "sag", "saga",
],
},
"penalty": {
"default": "l2",
"description": "Norm used in the penalization.",
"enum": ["l1", "l2"],
},
"C": {
"default": 1.0,
"description": "Inverse regularization strength. Smaller values specify stronger regularization.",
"type": "number",
"minimum": 0.0,
"exclusiveMinimum": true,
"minimumForOptimizer": 0.03125,
"maximumForOptimizer": 32768,
"distribution": "loguniform",
},
},
"relevantToOptimizer": ["solver", "penalty", "C"],
},
{
"anyOf": [
{
"type": "object",
"properties": {
"solver": {
"not": {"enum": ["newton-cg", "sag", "lbfgs"]}
}
},
},
{
"type": "object",
"properties": {"penalty": {"enum": ["l2"]}},
},
]
},
]
}
The next step is to specify the expected input and output type of the methods fit
, and predict
or transform
.
The fit
method of MyLR
takes two arguments, X
and y
. The X
argument is an array of arrays of
numbers. The outer array is over samples (rows) of a dataset. The inner
array is over features (columns) of a sample. The y
argument is an array
of non-negative numbers. Each element of y
is a label for the
corresponding sample in X
.
MyLR = MyLR.customize_schema(
input_fit=Object(
required=['X', 'y'],
additionalProperties=False,
X=Array(items=Array(items=Float())),
y=Array(items=Float())))
The schema for the arguments of the predict
method is similar, just
omitting y
:
MyLR = MyLR.customize_schema(
input_predict=Object(
required=['X'],
X=Array(items=Array(items=Float())),
additionalProperties=False))
The output schema indicates that the predict
method returns an array of
labels with the same schema as y
:
MyLR = MyLR.customize_schema(output_predict=Array(items=Float()))
Schemas can request transparent forwarding of specified methods/properties to the implementation.
For example, to add support for calling "intercept_" directly on a MyLR
object:
MyLR = MyLR.customize_schema(
forwards=["intercept_"])
This is supported only for methods/properties whose name does not conflict with names that lale uses in its operator wrapper classes. Methods/properties can still always be called via MyLR.impl.method()
Finally, we can add tags for discovery and documentation.
MyLR = MyLR.customize_schema(
tags= {'pre': ['~categoricals'],
'op': ['estimator', 'classifier', 'interpretable'],
'post': ['probabilities']})
We now have a complete JSON schemas for our MyLR
operator
ipython_display(MyLR._schemas)
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Schema for <class 'type'> auto-generated by lale.type_checking.get_default_schema().",
"type": "object",
"tags": {
"pre": ["~categoricals"],
"op": ["estimator", "classifier", "interpretable"],
"post": ["probabilities"],
},
"properties": {
"hyperparams": {
"allOf": [
{
"type": "object",
"properties": {
"solver": {
"default": "liblinear",
"description": "Algorithm for optimization problem.",
"enum": [
"newton-cg", "lbfgs", "liblinear", "sag",
"saga",
],
},
"penalty": {
"default": "l2",
"description": "Norm used in the penalization.",
"enum": ["l1", "l2"],
},
"C": {
"default": 1.0,
"description": "Inverse regularization strength. Smaller values specify stronger regularization.",
"type": "number",
"minimum": 0.0,
"exclusiveMinimum": true,
"minimumForOptimizer": 0.03125,
"maximumForOptimizer": 32768,
"distribution": "loguniform",
},
},
"relevantToOptimizer": ["solver", "penalty", "C"],
},
{
"anyOf": [
{
"type": "object",
"properties": {
"solver": {
"not": {
"enum": ["newton-cg", "sag", "lbfgs"]
}
}
},
},
{
"type": "object",
"properties": {"penalty": {"enum": ["l2"]}},
},
]
},
]
},
"input_fit": {
"type": "object",
"required": ["X", "y"],
"additionalProperties": false,
"properties": {
"X": {
"type": "array",
"items": {"type": "array", "items": {"type": "number"}},
},
"y": {"type": "array", "items": {"type": "number"}},
},
},
"input_predict": {
"type": "object",
"required": ["X"],
"additionalProperties": false,
"properties": {
"X": {
"type": "array",
"items": {"type": "array", "items": {"type": "number"}},
}
},
},
"output_predict": {"type": "array", "items": {"type": "number"}},
},
}
Once your operator implementation and schema definitions are ready, you can test it with Lale as follows. First, you will need to install Lale, as described in the installation) instructions.
Before demonstrating the new MyLR
operator, the following code loads the
Iris dataset, which comes out-of-the-box with scikit-learn.
import sklearn.datasets
import sklearn.utils
iris = sklearn.datasets.load_iris()
X_all, y_all = sklearn.utils.shuffle(iris.data, iris.target, random_state=42)
holdout_size = 30
X_train, y_train = X_all[holdout_size:], y_all[holdout_size:]
X_test, y_test = X_all[:holdout_size], y_all[:holdout_size]
print('expected {}'.format(y_test))
expected [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
Now that the data is in place, the following code sets the hyperparameters,
calls fit
to train, and calls predict
to make predictions. This code
looks almost like what people would usually write with scikit-learn, except
that it uses an enumeration MyLR.solver
that is implicitly defined by Lale
so users do not have to pass in error-prone strings for categorical
hyperparameters.
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
trainable = MyLR(MyLR.enum.solver.lbfgs, C=0.1)
trained = trainable.fit(X_train, y_train)
predictions = trained.predict(X_test)
print('actual {}'.format(predictions))
actual [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]
To illustrate interactive documentation, the following code retrieves the
specification of the C
hyperparameter.
MyLR.hyperparam_schema('C')
{'default': 1.0, 'description': 'Inverse regularization strength. Smaller values specify stronger regularization.', 'type': 'number', 'minimum': 0.0, 'exclusiveMinimum': True, 'minimumForOptimizer': 0.03125, 'maximumForOptimizer': 32768, 'distribution': 'loguniform'}
Similarly, operator tags are reflected via Python methods on the operator:
print(MyLR.has_tag('interpretable'))
print(MyLR.get_tags())
True {'pre': ['~categoricals'], 'op': ['estimator', 'classifier', 'interpretable'], 'post': ['probabilities']}
To illustrate error-checking, the following code showcases an invalid hyperparameter caught by JSON schema validation.
import jsonschema, sys
try:
MyLR(solver='adam')
except jsonschema.ValidationError as e:
print(e.message, file=sys.stderr)
Invalid configuration for MyLR(solver='adam') due to invalid value solver=adam. Schema of argument solver: { "default": "liblinear", "description": "Algorithm for optimization problem.", "enum": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"], } Value: adam
Finally, to illustrate hyperparameter optimization, the following code uses
hyperopt. We will document the
hyperparameter optimization use-case in more detail elsewhere. Here we only
demonstrate that Lale with MyLR
supports it.
from lale.search.op2hp import hyperopt_search_space
from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval
from sklearn.metrics import accuracy_score
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
def objective(hyperparams):
del hyperparams['name']
trainable = MyLR(**hyperparams)
trained = trainable.fit(X_train, y_train)
predictions = trained.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
return {'loss': -accuracy, 'status': STATUS_OK}
#The following line is enabled by the hyperparameter schema.
search_space = hyperopt_search_space(MyLR)
trials = Trials()
fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials)
best_hyperparams = space_eval(search_space, trials.argmin)
print('best hyperparameter combination {}'.format(best_hyperparams))
100%|██████████| 10/10 [00:00<00:00, 54.54trial/s, best loss: -1.0] best hyperparameter combination {'C': 18098.51542502289, 'name': '__main__.MyLR', 'penalty': 'l2', 'solver': 'saga'}
This concludes the running example. To summarize, we have learned how to write an operator implementation class and JSON schemas; how to register the Lale operator; and how to use the Lale operator for manual as well as automated machine-learning.
Besides X
and y
, the fit
method in scikit-learn sometimes has
additional arguments. Lale also supports such additional arguments.
In addition to the __init__
, fit
, and predict
methods, many
scikit-learn estimators also have a predict_proba
method. Lale will
support that with its own metadata schema.
We encourage you to add your new operator to Lale. Take a look at the "Developer's Certificate of Origin", which can be found in DCO1.1.txt. Can you agree to the terms in the DCO, as well as to the license? If yes, check out the "For Developers" part at the end of the Installation instructions, and follow the steps to create a pull request.
This section documents features of JSON Schema that Lale uses, as well as extensions that Lale adds to JSON schema for information specific to the machine-learning domain. For a more comprehensive introduction to JSON Schema, refer to its Reference.
The following table lists kinds of schemas in JSON Schema:
Kind of schema | Corresponding type in Python/Lale |
---|---|
null |
NoneType , value None |
boolean |
bool , values True or False |
string |
str |
enum |
See discussion below. |
number |
float , .e.g, 0.1 |
integer |
int , e.g., 42 |
array |
See discussion below. |
object |
dict with string keys |
anyOf , allOf , not |
See discussion below. |
The use of null
, boolean
, and string
is fairly straightforward. The
following paragraphs discuss the other kinds of schemas one by one.
In JSON Schema, an enum can contain assorted values including strings,
numbers, or even null
. Lale uses enums of strings for categorical
hyperparameters, such as 'penalty': {'enum': ['l1', 'l2']}
in the earlier
example. In that case, Lale also automatically declares a corresponding
Python enum
.
When Lale uses enums of other types, it is usually to restrict a
hyperparameter to a single value, such as 'enum': [None]
.
In schemas with type
set to number
or integer
, JSON schema lets users
specify minimum
, maximum
,
exclusiveMinimum
, and exclusiveMaximum
. Lale further extends JSON schema
with minimumForOptimizer
, maximumForOptimizer
, and distribution
.
Possible values for the distribution
are 'uniform'
(the default) and
'loguniform'
. In the case of integers, Lale quantizes the distributions
accordingly.
Lale schemas for input and output data make heavy use of the JSON Schema
array
type. In this case, Lale schemas are intended to capture logical
schemas, not physical representations, similarly to how relational databases
hide physical representations behind a well-formalized abstraction layer.
Therefore, Lale uses arrays from JSON Schema for several types in Python.
The most obvious one is a Python list
. Another common one is a numpy
ndarray,
where Lale uses nested arrays to represent each of the dimensions of a
multi-dimensional array. Lale also has support for pandas.DataFrame
and
pandas.Series
, for which it again uses JSON Schema arrays.
For arrays, JSON schema lets users specify items
, minItems
, and
maxItems
. Lale further extends JSON schema with minItemsForOptimizer
and
maxItemsForOptimizer
. Furthermore, Lale supports a laleType
,
which can be 'Any'
to locally disable a subschema check, or'tuple'
to support cases where the Python code requires a
tuple instead of a list.
For objects, JSON schema lets users specify a list required
of properties
that must be present, a dictionary properties
of sub-schemas, and a flag
additionalProperties
to indicate whether the object can have additional
properties beyond the keys of the properties
dictionary. Lale further
extends JSON schema with a relevantToOptimizer
list of properties that
hyperparameter optimizers should search over.
For individual properties, Lale supports a default
, which is inspired by
and consistent with web API specification practice. It also supports a
forOptimizer
flag which defaults to True
but can be set to False
to
hide a particular subschema from the hyperparameter optimizer. For example,
the number of components for PCA in scikit-learn can be specified as an
integer or a floating point number, but an optimizer should only explore one
of these choices. Lale supports a Boolean flag transient
that, if true,
elides a hyperparameter during pretty-printing, visualization, or in JSON.
As discussed before, in JSON schema, allOf
is a logical "and", anyOf
is
a logical "or", and not
is a logical negation. The running example from
earlier already illustrated how to use these for implementing cross-cutting
constraints. Another use-case that takes advantage of anyOf
is for
expressing union types, which arise frequently in scikit-learn. For example,
here is the schema for n_components
from PCA:
n_components_sch = AnyOf(
[Enum([None], desc="If not set, keep all components."),
Enum(['mle'], desc="Use Minka's MLE to guess the dimension."),
Float(minimum=0.0, exclusiveMinimum=True,
maximum=1.0, exclusiveMaximum=True,
desc='Select the number of components such that the amount of variance '
'that needs to be explained is greater than the specified percentage.'),
Int(minimum=1, forOptimizer=False, desc='Number of components to keep.')],
default=None)
ipython_display(n_components_sch.schema)
{
"default": null,
"anyOf": [
{"description": "If not set, keep all components.", "enum": [null]},
{
"description": "Use Minka's MLE to guess the dimension.",
"enum": ["mle"],
},
{
"description": "Select the number of components such that the amount of variance that needs to be explained is greater than the specified percentage.",
"type": "number",
"minimum": 0.0,
"exclusiveMinimum": true,
"maximum": 1.0,
"exclusiveMaximum": true,
},
{
"description": "Number of components to keep.",
"forOptimizer": false,
"type": "integer",
"minimum": 1,
},
],
}
We encourage users to make their schemas more readable by also including
common JSON schema metadata such as $schema
and description
. As seen in
the examples in this document, Lale also extends JSON schema with tags
and documentation_url
. Finally, in some cases, schema-internal
duplication can be avoided by cross-references and linking. This is
supported by off-the-shelf features of JSON schema without requiring
Lale-specific extensions.