#!/usr/bin/env python # coding: utf-8 # # Wrapping New Individual Operators # # Lale comes with several library operators, so you do not need to write # your own. But if you want to contribute new operators, this tutorial is # for you. First let us review some basic concepts in Lale from the # point of view of adding new operators (estimators and # transformers). Lale is a library for semi-automated data science, and # is designed for the following goals: # # * Automation: easy search and tuning of pipelines # * Usability: scikit-learn compatible, plus types # * Interoperability: support for Python building blocks and beyond # # To enable the above properties for your operators with Lale, you need to: # # 1. Write an operator implementation class with methods `__init__`, # `fit`, and `predict` or `transform`. If you have a custom estimator # or transformer as per scikit-learn, you can skip this step as that # is already a valid Lale operator. # 2. Register the operator implementation from Step 1 via `lale.operators.make_operator`. # This step automatically creates a JSON schema skeleton for your operator. # 3. Customize the hyperparameter JSON schema to indicate what # hyperparameters are expected by an operator, to specify the types, # default values, and recommended minimum/maximum values for # automatic tuning. The hyperparameter schema can also encode # constraints indicating dependencies between hyperparameter values # such as solver `abc` only supports penalty `xyz`. # 4. Optionally, customize the schemas for input and output datasets, # and enable transparent access to custom methods/properties of your implementation. # 5. Test and use the new operator, for instance, for training or # hyperparameter optimization. # 6. Consider contributing your new operator to the Lale open-source # project. # # The next sections illustrate these five steps using an example. After # the example-driven sections, this document concludes with a reference # covering features from the example and beyond. This document focuses # on individual operators. Pipelines that compose # multiple operators are documented elsewhere. # ## 1. Create a New Operator # # This section can be skipped if you already have a scikit-learn # compatible estimator or transformer class with methods `__init__`, # `fit`, and `predict` or `transform`. Any other compatibility with # scikit-learn such as `get_params` or `set_params` is optional, and so # is extending from `sklearn.base.BaseEstimator`. # # This section illustrates how to implement this class with the help of # an example. The running example in this document is a simple custom # operator that just wraps the `LogisticRegression` estimator from # scikit-learn. Of course you can write a similar class to wrap your own # operators, which do not need to come from scikit-learn. The following # code defines a class `MyLRImpl`. # In[1]: import sklearn.linear_model class _MyLRImpl: def __init__(self, **hyperparams): self._wrapped_model = sklearn.linear_model.LogisticRegression( **hyperparams) def fit(self, X, y): self._wrapped_model.fit(X, y) return self def predict(self, X, **kwargs): return self._wrapped_model.predict(X, **kwargs) # This code first imports the relevant scikit-learn package. Then, it declares # a new class for wrapping it. Currently, Lale only supports Python, but # eventually, it will also support other programming languages. Therefore, the # Lale approach for wrapping new operators carefully avoids depending too much # on the Python language or any particular Python library. Hence, the # `MyLRImpl` class does not need to inherit from anything, but it does need to # follow certain conventions: # # * It has a constructor, `__init__`, whose arguments are the # hyperparameters. # # * It has a training method, `fit`, with an argument `X` containing the # training examples and, in the case of supervised models, an argument `y` # containing labels. The `fit` method creates an instance of the scikit-learn # `LogisticRegression` operator, trains it, and returns the wrapper object. # # * It has a prediction method, `predict` for an estimator or `transform` for # a transformer. The method has an argument `X` containing the test examples # and returns the labels for `predict` or the transformed data for # `transform`. # # These conventions are designed to be similar to those of scikit-learn. # However, they avoid a code dependency upon scikit-learn. # # Note that in a simple example like this, the underlying # `sklearn.linear_model.LogisticRegression` class could be used directly, # without needing the `_MyLRImpl` wrapper. However, creating such a wrapper # is useful for more complicated examples. # # ## 2. Register a New Lale Operator # # We can now register `_MyLRImpl` as a new Lale operator `MyLR`. # In[2]: import lale.operators MyLR = lale.operators.make_operator(_MyLRImpl) # The call to `make_operator` automatically creates a skeleton JSON schema for # the hyperparameters and the operator methods and attaches it to `MyLR`. # In[3]: from lale.pretty_print import ipython_display ipython_display(MyLR._schemas) # ## 3. Customize Hyperparameter Schema # # Lale requires schemas both for error-checking and for generating search # spaces for hyperparameter optimization. # The schemas of a Lale operator specify the space of valid values for # hyperparameters, for the arguments to `fit` and `predict` or `transform`, # and for the output of `predict` or `transform`. To keep the schemas # independent of the Python programming language, they are expressed as # [JSON Schema](https://json-schema.org/understanding-json-schema/reference/). # JSON Schema is currently a draft standard and is already being widely # adopted and implemented, for instance, as part of specifying # [Swagger APIs](https://www.openapis.org/). # # The schema of a Lale operator can be incrementally customized using the `customize_schema` method wich returns a copy of the operator with the customized schema. The `customize_schema` method also validates the new schema for early error reporting. # # Instead of manually writing the schemas -- which can be error prone -- we provide # a dedicated API to help the authoring of operator schemas. # # The running example chooses hyperparameters of scikit-learn LogisticRegression that illustrate all the interesting cases. More complete and elaborate examples can be found in the Lale standard library. The following specifies each hyperparameter one at a time, omitting cross-cutting constraints. # In[4]: from lale.schemas import Null, Enum, Int, Float, Object, Array, Not, AnyOf MyLR = MyLR.customize_schema( relevantToOptimizer=['solver', 'penalty', 'C'], solver=Enum(desc='Algorithm for optimization problem.', values=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], default='liblinear'), penalty=Enum(desc='Norm used in the penalization.', values=['l1', 'l2'], default='l2'), C=Float(desc='Inverse regularization strength. ' 'Smaller values specify stronger regularization.', minimum=0.0, exclusiveMinimum=True, minimumForOptimizer=0.03125, maximumForOptimizer=32768, distribution='loguniform', default=1.0)) # Here, `solver` and `penalty` are categorical hyperparameters and `C` is a # continuous hyperparameter. For all three hyperparameters, the schema # includes a description, used for interactive documentation, and a # default value, used when no explicit value is specified. The categorical # hyperparameters are then specified as enumerations of their legal values. # In contrast, the continuous hyperparameter is a number, and the schema # includes additional information such as its distribution, minimum, and # maximum. In the example, `C` has `'minimum': 0.0`, indicating that only # positive values are valid. Furthermore, `C` has a # `'minimumForOptimizer': 0.03125` and `'maxmumForOptimizer': 32768`, # guiding the optimizer to limit its search space. # # ### Constraints # # Besides specifying hyperparameters one at a time, users may also want to # specify cross-cutting constraints to further restrict the hyperparameter # schema. This part is an advanced use case and can be skipped by novice # users. # In[5]: MyLR = MyLR.customize_schema( constraint=AnyOf([Object(solver=Not(Enum(['newton-cg', 'sag', 'lbfgs']))), Object(penalty=Enum(['l2']))])) # In JSON schema, `allOf` is a logical "and", `anyOf` is a logical "or", and # `not` is a logical negation. Thus, the `anyOf` part of the example can be # read as # # ```python # assert not (solver in ['newton-cg', 'sag', 'lbfgs']) or penalty == 'l2' # ``` # # By standard Boolean rules, this is equivalent to a logical implication: # # ```python # if solver in ['newton-cg', 'sag', 'lbfgs']: # assert penalty == 'l2' # ``` # # The complete hyperparameters schema simply combines the ranges with the # constraints: # In[6]: ipython_display(MyLR.hyperparam_schema()) # ## 4. Customize Fit and Predict Schemas # # The next step is to specify the expected input and output type of the methods `fit`, and `predict` or `transform`. # # The `fit` method of `MyLR` # takes two arguments, `X` and `y`. The `X` argument is an array of arrays of # numbers. The outer array is over samples (rows) of a dataset. The inner # array is over features (columns) of a sample. The `y` argument is an array # of non-negative numbers. Each element of `y` is a label for the # corresponding sample in `X`. # In[7]: MyLR = MyLR.customize_schema( input_fit=Object( required=['X', 'y'], additionalProperties=False, X=Array(items=Array(items=Float())), y=Array(items=Float()))) # The schema for the arguments of the `predict` method is similar, just # omitting `y`: # In[8]: MyLR = MyLR.customize_schema( input_predict=Object( required=['X'], X=Array(items=Array(items=Float())), additionalProperties=False)) # The output schema indicates that the `predict` method returns an array of # labels with the same schema as `y`: # In[9]: MyLR = MyLR.customize_schema(output_predict=Array(items=Float())) # ## Transparent method/property forwarding # # Schemas can request transparent forwarding of specified methods/properties to the implementation. # For example, to add support for calling "intercept_" directly on a `MyLR` object: # In[ ]: MyLR = MyLR.customize_schema( forwards=["intercept_"]) # This is supported only for methods/properties whose name does not conflict with names that lale uses in its operator wrapper classes. # Methods/properties can still always be called via MyLR.impl.method() # ## Tags # # Finally, we can add tags for discovery and documentation. # In[10]: MyLR = MyLR.customize_schema( tags= {'pre': ['~categoricals'], 'op': ['estimator', 'classifier', 'interpretable'], 'post': ['probabilities']}) # We now have a complete JSON schemas for our `MyLR` operator # In[11]: ipython_display(MyLR._schemas) # ## 5. Testing and Using the new Operator # # Once your operator implementation and schema definitions are ready, # you can test it with Lale as follows. First, you will need to install # Lale, as described in the # [installation](../../master/docs/installation.md)) instructions. # ### Use the new Operator # # Before demonstrating the new `MyLR` operator, the following code loads the # Iris dataset, which comes out-of-the-box with scikit-learn. # In[12]: import sklearn.datasets import sklearn.utils iris = sklearn.datasets.load_iris() X_all, y_all = sklearn.utils.shuffle(iris.data, iris.target, random_state=42) holdout_size = 30 X_train, y_train = X_all[holdout_size:], y_all[holdout_size:] X_test, y_test = X_all[:holdout_size], y_all[:holdout_size] print('expected {}'.format(y_test)) # Now that the data is in place, the following code sets the hyperparameters, # calls `fit` to train, and calls `predict` to make predictions. This code # looks almost like what people would usually write with scikit-learn, except # that it uses an enumeration `MyLR.solver` that is implicitly defined by Lale # so users do not have to pass in error-prone strings for categorical # hyperparameters. # In[13]: import warnings warnings.filterwarnings("ignore", category=FutureWarning) trainable = MyLR(MyLR.enum.solver.lbfgs, C=0.1) trained = trainable.fit(X_train, y_train) predictions = trained.predict(X_test) print('actual {}'.format(predictions)) # To illustrate interactive documentation, the following code retrieves the # specification of the `C` hyperparameter. # In[14]: MyLR.hyperparam_schema('C') # Similarly, operator tags are reflected via Python methods on the operator: # In[15]: print(MyLR.has_tag('interpretable')) print(MyLR.get_tags()) # To illustrate error-checking, the following code showcases an invalid # hyperparameter caught by JSON schema validation. # In[16]: import jsonschema, sys try: MyLR(solver='adam') except jsonschema.ValidationError as e: print(e.message, file=sys.stderr) # Finally, to illustrate hyperparameter optimization, the following code uses # [hyperopt](http://hyperopt.github.io/hyperopt/). We will document the # hyperparameter optimization use-case in more detail elsewhere. Here we only # demonstrate that Lale with `MyLR` supports it. # In[17]: from lale.search.op2hp import hyperopt_search_space from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval from sklearn.metrics import accuracy_score from sklearn.exceptions import ConvergenceWarning warnings.filterwarnings("ignore", category=ConvergenceWarning) def objective(hyperparams): del hyperparams['name'] trainable = MyLR(**hyperparams) trained = trainable.fit(X_train, y_train) predictions = trained.predict(X_test) accuracy = accuracy_score(y_test, predictions) return {'loss': -accuracy, 'status': STATUS_OK} #The following line is enabled by the hyperparameter schema. search_space = hyperopt_search_space(MyLR) trials = Trials() fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials) best_hyperparams = space_eval(search_space, trials.argmin) print('best hyperparameter combination {}'.format(best_hyperparams)) # This concludes the running example. To # summarize, we have learned how to write an operator implementation class and JSON # schemas; how to register the Lale operator; and how to use the Lale operator for # manual as well as automated machine-learning. # # ### Additional Wrapper Class Features # # Besides `X` and `y`, the `fit` method in scikit-learn sometimes has # additional arguments. Lale also supports such additional arguments. # # In addition to the `__init__`, `fit`, and `predict` methods, many # scikit-learn estimators also have a `predict_proba` method. Lale will # support that with its own metadata schema. # # ## 6. Consider Contributing to Lale Open-Source Project # # We encourage you to add your new operator to Lale. Take a look at the # "Developer's Certificate of Origin", which can be found in # [DCO1.1.txt](https://github.com/IBM/lale/blob/master/DCO1.1.txt). # Can you agree to the terms in the DCO, as well as to the # [license](https://github.com/IBM/lale/blob/master/LICENSE.txt)? # If yes, check out the "For Developers" part at the end of the # [Installation instructions](https://github.com/IBM/lale/blob/master/docs/installation.rst), # and follow the steps to create a pull request. # # ## 7. Reference # # This section documents features of JSON Schema that Lale uses, as well as # extensions that Lale adds to JSON schema for information specific to the # machine-learning domain. For a more comprehensive introduction to JSON # Schema, refer to its # [Reference](https://json-schema.org/understanding-json-schema/reference/). # # The following table lists kinds of schemas in JSON Schema: # # | Kind of schema | Corresponding type in Python/Lale | # | ---------------| ---------------------------- | # | `null` | `NoneType`, value `None` | # | `boolean` | `bool`, values `True` or `False` | # | `string` | `str` | # | `enum` | See discussion below. | # | `number` | `float`, .e.g, `0.1` | # | `integer` | `int`, e.g., `42` | # | `array` | See discussion below. | # | `object` | `dict` with string keys | # | `anyOf`, `allOf`, `not` | See discussion below. | # # The use of `null`, `boolean`, and `string` is fairly straightforward. The # following paragraphs discuss the other kinds of schemas one by one. # # ### 7.1. enum # # In JSON Schema, an enum can contain assorted values including strings, # numbers, or even `null`. Lale uses enums of strings for categorical # hyperparameters, such as `'penalty': {'enum': ['l1', 'l2']}` in the earlier # example. In that case, Lale also automatically declares a corresponding # Python `enum`. # When Lale uses enums of other types, it is usually to restrict a # hyperparameter to a single value, such as `'enum': [None]`. # # ### 7.2. number, integer # # In schemas with `type` set to `number` or `integer`, JSON schema lets users # specify `minimum`, `maximum`, # `exclusiveMinimum`, and `exclusiveMaximum`. Lale further extends JSON schema # with `minimumForOptimizer`, `maximumForOptimizer`, and `distribution`. # Possible values for the `distribution` are `'uniform'` (the default) and # `'loguniform'`. In the case of integers, Lale quantizes the distributions # accordingly. # # ### 7.3. array # # Lale schemas for input and output data make heavy use of the JSON Schema # `array` type. In this case, Lale schemas are intended to capture logical # schemas, not physical representations, similarly to how relational databases # hide physical representations behind a well-formalized abstraction layer. # Therefore, Lale uses arrays from JSON Schema for several types in Python. # The most obvious one is a Python `list`. Another common one is a numpy # [ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html), # where Lale uses nested arrays to represent each of the dimensions of a # multi-dimensional array. Lale also has support for `pandas.DataFrame` and # `pandas.Series`, for which it again uses JSON Schema arrays. # # For arrays, JSON schema lets users specify `items`, `minItems`, and # `maxItems`. Lale further extends JSON schema with `minItemsForOptimizer` and # `maxItemsForOptimizer`. Furthermore, Lale supports a `laleType`, # which can be `'Any'` to locally disable a subschema check, or`'tuple'` to support cases where the Python code requires a # tuple instead of a list. # # ### 7.4. object # # For objects, JSON schema lets users specify a list `required` of properties # that must be present, a dictionary `properties` of sub-schemas, and a flag # `additionalProperties` to indicate whether the object can have additional # properties beyond the keys of the `properties` dictionary. Lale further # extends JSON schema with a `relevantToOptimizer` list of properties that # hyperparameter optimizers should search over. # # For individual properties, Lale supports a `default`, which is inspired by # and consistent with web API specification practice. It also supports a # `forOptimizer` flag which defaults to `True` but can be set to `False` to # hide a particular subschema from the hyperparameter optimizer. For example, # the number of components for PCA in scikit-learn can be specified as an # integer or a floating point number, but an optimizer should only explore one # of these choices. Lale supports a Boolean flag `transient` that, if true, # elides a hyperparameter during pretty-printing, visualization, or in JSON. # # ### 7.5. allOf, anyOf, not # # As discussed before, in JSON schema, `allOf` is a logical "and", `anyOf` is # a logical "or", and `not` is a logical negation. The running example from # earlier already illustrated how to use these for implementing cross-cutting # constraints. Another use-case that takes advantage of `anyOf` is for # expressing union types, which arise frequently in scikit-learn. For example, # here is the schema for `n_components` from PCA: # In[18]: n_components_sch = AnyOf( [Enum([None], desc="If not set, keep all components."), Enum(['mle'], desc="Use Minka's MLE to guess the dimension."), Float(minimum=0.0, exclusiveMinimum=True, maximum=1.0, exclusiveMaximum=True, desc='Select the number of components such that the amount of variance ' 'that needs to be explained is greater than the specified percentage.'), Int(minimum=1, forOptimizer=False, desc='Number of components to keep.')], default=None) ipython_display(n_components_sch.schema) # ### 7.6. Schema Metadata # # We encourage users to make their schemas more readable by also including # common JSON schema metadata such as `$schema` and `description`. As seen in # the examples in this document, Lale also extends JSON schema with `tags` # and `documentation_url`. Finally, in some cases, schema-internal # duplication can be avoided by cross-references and linking. This is # supported by off-the-shelf features of JSON schema without requiring # Lale-specific extensions.