#!/usr/bin/env python
# coding: utf-8

# # Wrapping New Individual Operators
# 
# Lale comes with several library operators, so you do not need to write
# your own. But if you want to contribute new operators, this tutorial is
# for you.  First let us review some basic concepts in Lale from the
# point of view of adding new operators (estimators and
# transformers). Lale is a library for semi-automated data science, and
# is designed for the following goals:
# 
# * Automation: easy search and tuning of pipelines
# * Usability: scikit-learn compatible, plus types
# * Interoperability: support for Python building blocks and beyond
# 
# To enable the above properties for your operators with Lale, you need to:
# 
# 1. Write an operator implementation class with methods `__init__`,
#    `fit`, and `predict` or `transform`. If you have a custom estimator
#    or transformer as per scikit-learn, you can skip this step as that
#    is already a valid Lale operator.
# 2. Register the operator implementation from Step 1 via `lale.operators.make_operator`. 
#    This step automatically creates a JSON schema skeleton for your operator.
# 3. Customize the hyperparameter JSON schema to indicate what
#    hyperparameters are expected by an operator, to specify the types,
#    default values, and recommended minimum/maximum values for
#    automatic tuning. The hyperparameter schema can also encode
#    constraints indicating dependencies between hyperparameter values
#    such as solver `abc` only supports penalty `xyz`.
# 4. Optionally, customize the schemas for input and output datasets, 
#    and enable transparent access to custom methods/properties of your implementation.
# 5. Test and use the new operator, for instance, for training or
#    hyperparameter optimization.
# 6. Consider contributing your new operator to the Lale open-source
#    project.
# 
# The next sections illustrate these five steps using an example.  After
# the example-driven sections, this document concludes with a reference
# covering features from the example and beyond.  This document focuses
# on individual operators. Pipelines that compose
# multiple operators are documented elsewhere.

# ## 1. Create a New Operator
# 
# This section can be skipped if you already have a scikit-learn
# compatible estimator or transformer class with methods `__init__`,
# `fit`, and `predict` or `transform`. Any other compatibility with
# scikit-learn such as `get_params` or `set_params` is optional, and so
# is extending from `sklearn.base.BaseEstimator`.
# 
# This section illustrates how to implement this class with the help of
# an example. The running example in this document is a simple custom
# operator that just wraps the `LogisticRegression` estimator from
# scikit-learn. Of course you can write a similar class to wrap your own
# operators, which do not need to come from scikit-learn.  The following
# code defines a class `MyLRImpl`.

# In[1]:


import sklearn.linear_model

class _MyLRImpl:
    def __init__(self, **hyperparams):
        self._wrapped_model = sklearn.linear_model.LogisticRegression(
            **hyperparams)

    def fit(self, X, y):
        self._wrapped_model.fit(X, y)
        return self

    def predict(self, X, **kwargs):
        return self._wrapped_model.predict(X, **kwargs)


# This code first imports the relevant scikit-learn package. Then, it declares
# a new class for wrapping it. Currently, Lale only supports Python, but
# eventually, it will also support other programming languages. Therefore, the
# Lale approach for wrapping new operators carefully avoids depending too much
# on the Python language or any particular Python library. Hence, the
# `MyLRImpl` class does not need to inherit from anything, but it does need to
# follow certain conventions:
# 
# * It has a constructor, `__init__`, whose arguments are the
#   hyperparameters.
# 
# * It has a training method, `fit`, with an argument `X` containing the
#   training examples and, in the case of supervised models, an argument `y`
#   containing labels. The `fit` method creates an instance of the scikit-learn
#   `LogisticRegression` operator, trains it, and returns the wrapper object.
# 
# * It has a prediction method, `predict` for an estimator or `transform` for
#   a transformer. The method has an argument `X` containing the test examples
#   and returns the labels for `predict` or the transformed data for
#   `transform`.
# 
# These conventions are designed to be similar to those of scikit-learn.
# However, they avoid a code dependency upon scikit-learn.
# 
# Note that in a simple example like this, the underlying
# `sklearn.linear_model.LogisticRegression` class could be used directly,
# without needing the `_MyLRImpl` wrapper.  However, creating such a wrapper
# is useful for more complicated examples.
# 
# ## 2. Register a New Lale Operator
# 
# We can now register `_MyLRImpl` as a new Lale operator `MyLR`.

# In[2]:


import lale.operators
MyLR = lale.operators.make_operator(_MyLRImpl)


# The call to `make_operator` automatically creates a skeleton JSON schema for
# the hyperparameters and the operator methods and attaches it to `MyLR`.

# In[3]:


from lale.pretty_print import ipython_display
ipython_display(MyLR._schemas)


# ## 3. Customize Hyperparameter Schema
# 
# Lale requires schemas both for error-checking and for generating search
# spaces for hyperparameter optimization.
# The schemas of a Lale operator specify the space of valid values for
# hyperparameters, for the arguments to `fit` and `predict` or `transform`,
# and for the output of `predict` or `transform`. To keep the schemas
# independent of the Python programming language, they are expressed as
# [JSON Schema](https://json-schema.org/understanding-json-schema/reference/).
# JSON Schema is currently a draft standard and is already being widely
# adopted and implemented, for instance, as part of specifying
# [Swagger APIs](https://www.openapis.org/).
# 
# The schema of a Lale operator can be incrementally customized using the `customize_schema` method wich returns a copy of the operator with the customized schema. The `customize_schema` method also validates the new schema for early error reporting.
# 
# Instead of manually writing the schemas -- which can be error prone -- we provide 
# a dedicated API to help the authoring of operator schemas.
# 
# The running example chooses hyperparameters of scikit-learn LogisticRegression that illustrate all the interesting cases. More complete and elaborate examples can be found in the Lale standard library. The following specifies each hyperparameter one at a time, omitting cross-cutting constraints.

# In[4]:


from lale.schemas import Null, Enum, Int, Float, Object, Array, Not, AnyOf

MyLR = MyLR.customize_schema(
    relevantToOptimizer=['solver', 'penalty', 'C'],
    solver=Enum(desc='Algorithm for optimization problem.',
                values=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                default='liblinear'),
    penalty=Enum(desc='Norm used in the penalization.',
                 values=['l1', 'l2'],
                 default='l2'),
    C=Float(desc='Inverse regularization strength. '
                 'Smaller values specify stronger regularization.',
            minimum=0.0, exclusiveMinimum=True,
            minimumForOptimizer=0.03125, maximumForOptimizer=32768, 
            distribution='loguniform',
            default=1.0))


# Here, `solver` and `penalty` are categorical hyperparameters and `C` is a
# continuous hyperparameter. For all three hyperparameters, the schema
# includes a description, used for interactive documentation, and a
# default value, used when no explicit value is specified. The categorical
# hyperparameters are then specified as enumerations of their legal values.
# In contrast, the continuous hyperparameter is a number, and the schema
# includes additional information such as its distribution, minimum, and
# maximum. In the example, `C` has `'minimum': 0.0`, indicating that only
# positive values are valid. Furthermore, `C` has a
# `'minimumForOptimizer': 0.03125` and `'maxmumForOptimizer': 32768`,
# guiding the optimizer to limit its search space.
# 
# ### Constraints
# 
# Besides specifying hyperparameters one at a time, users may also want to
# specify cross-cutting constraints to further restrict the hyperparameter
# schema. This part is an advanced use case and can be skipped by novice
# users.

# In[5]:


MyLR = MyLR.customize_schema(
    constraint=AnyOf([Object(solver=Not(Enum(['newton-cg', 'sag', 'lbfgs']))),
                      Object(penalty=Enum(['l2']))]))


# In JSON schema, `allOf` is a logical "and", `anyOf` is a logical "or", and
# `not` is a logical negation. Thus, the `anyOf` part of the example can be
# read as
# 
# ```python
# assert not (solver in ['newton-cg', 'sag', 'lbfgs']) or penalty == 'l2'
# ```
# 
# By standard Boolean rules, this is equivalent to a logical implication:
# 
# ```python
# if solver in ['newton-cg', 'sag', 'lbfgs']:
#     assert penalty == 'l2'
# ```
# 
# The complete hyperparameters schema simply combines the ranges with the
# constraints:

# In[6]:


ipython_display(MyLR.hyperparam_schema())


# ## 4. Customize Fit and Predict  Schemas
# 
# The next step is to specify the expected input and output type of the methods `fit`, and `predict` or `transform`.
# 
# The `fit` method of `MyLR`
# takes two arguments, `X` and `y`. The `X` argument is an array of arrays of
# numbers. The outer array is over samples (rows) of a dataset. The inner
# array is over features (columns) of a sample. The `y` argument is an array
# of non-negative numbers. Each element of `y` is a label for the
# corresponding sample in `X`.

# In[7]:


MyLR = MyLR.customize_schema(
    input_fit=Object(
        required=['X', 'y'],
        additionalProperties=False,
        X=Array(items=Array(items=Float())),
        y=Array(items=Float())))


# The schema for the arguments of the `predict` method is similar, just
# omitting `y`:

# In[8]:


MyLR = MyLR.customize_schema(
    input_predict=Object(
        required=['X'],
        X=Array(items=Array(items=Float())), 
        additionalProperties=False))


# The output schema indicates that the `predict` method returns an array of
# labels with the same schema as `y`:

# In[9]:


MyLR = MyLR.customize_schema(output_predict=Array(items=Float()))


# ## Transparent method/property forwarding
# 
# Schemas can request transparent forwarding of specified methods/properties to the implementation.
# For example, to add support for calling "intercept_" directly on a `MyLR` object:

# In[ ]:


MyLR = MyLR.customize_schema(
    forwards=["intercept_"])


# This is supported only for methods/properties whose name does not conflict with names that lale uses in its operator wrapper classes.
# Methods/properties can still always be called via MyLR.impl.method()

# ## Tags
# 
# Finally, we can add tags for discovery and documentation.

# In[10]:


MyLR = MyLR.customize_schema(
    tags= {'pre': ['~categoricals'],
           'op': ['estimator', 'classifier', 'interpretable'],
           'post': ['probabilities']})


# We now have a complete JSON schemas for our `MyLR` operator

# In[11]:


ipython_display(MyLR._schemas)


# ## 5. Testing and Using the new Operator
# 
# Once your operator implementation and schema definitions are ready,
# you can test it with Lale as follows. First, you will need to install
# Lale, as described in the
# [installation](../../master/docs/installation.md)) instructions.

# ### Use the new Operator
# 
# Before demonstrating the new `MyLR` operator, the following code loads the
# Iris dataset, which comes out-of-the-box with scikit-learn.

# In[12]:


import sklearn.datasets
import sklearn.utils
iris = sklearn.datasets.load_iris()
X_all, y_all = sklearn.utils.shuffle(iris.data, iris.target, random_state=42)
holdout_size = 30
X_train, y_train = X_all[holdout_size:], y_all[holdout_size:]
X_test, y_test = X_all[:holdout_size], y_all[:holdout_size]
print('expected {}'.format(y_test))


# Now that the data is in place, the following code sets the hyperparameters,
# calls `fit` to train, and calls `predict` to make predictions. This code
# looks almost like what people would usually write with scikit-learn, except
# that it uses an enumeration `MyLR.solver` that is implicitly defined by Lale
# so users do not have to pass in error-prone strings for categorical
# hyperparameters.

# In[13]:


import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

trainable = MyLR(MyLR.enum.solver.lbfgs, C=0.1)
trained = trainable.fit(X_train, y_train)
predictions = trained.predict(X_test)
print('actual {}'.format(predictions))


# To illustrate interactive documentation, the following code retrieves the
# specification of the `C` hyperparameter.

# In[14]:


MyLR.hyperparam_schema('C')


# Similarly, operator tags are reflected via Python methods on the operator:

# In[15]:


print(MyLR.has_tag('interpretable'))
print(MyLR.get_tags())


# To illustrate error-checking, the following code showcases an invalid
# hyperparameter caught by JSON schema validation.

# In[16]:


import jsonschema, sys
try:
    MyLR(solver='adam')
except jsonschema.ValidationError as e:
    print(e.message, file=sys.stderr)


# Finally, to illustrate hyperparameter optimization, the following code uses
# [hyperopt](http://hyperopt.github.io/hyperopt/). We will document the
# hyperparameter optimization use-case in more detail elsewhere. Here we only
# demonstrate that Lale with `MyLR` supports it. 

# In[17]:


from lale.search.op2hp import hyperopt_search_space
from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval
from sklearn.metrics import accuracy_score
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

def objective(hyperparams):
    del hyperparams['name']
    trainable = MyLR(**hyperparams)
    trained = trainable.fit(X_train, y_train)
    predictions = trained.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    return {'loss': -accuracy, 'status': STATUS_OK}

#The following line is enabled by the hyperparameter schema.
search_space = hyperopt_search_space(MyLR)

trials = Trials()
fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials)
best_hyperparams = space_eval(search_space, trials.argmin)
print('best hyperparameter combination {}'.format(best_hyperparams))


# This concludes the running example. To
# summarize, we have learned how to write an operator implementation class and JSON
# schemas; how to register the Lale operator; and how to use the Lale operator for
# manual as well as automated machine-learning.
# 
# ### Additional Wrapper Class Features
# 
# Besides `X` and `y`, the `fit` method in scikit-learn sometimes has
# additional arguments. Lale also supports such additional arguments.
# 
# In addition to the `__init__`, `fit`, and `predict` methods, many
# scikit-learn estimators also have a `predict_proba` method. Lale will
# support that with its own metadata schema.
# 
# ## 6. Consider Contributing to Lale Open-Source Project
# 
# We encourage you to add your new operator to Lale. Take a look at the 
# "Developer's Certificate of Origin", which can be found in
# [DCO1.1.txt](https://github.com/IBM/lale/blob/master/DCO1.1.txt).
# Can you agree to the terms in the DCO, as well as to the
# [license](https://github.com/IBM/lale/blob/master/LICENSE.txt)?
# If yes, check out the "For Developers" part at the end of the
# [Installation instructions](https://github.com/IBM/lale/blob/master/docs/installation.rst),
# and follow the steps to create a pull request.
# 
# ## 7. Reference
# 
# This section documents features of JSON Schema that Lale uses, as well as
# extensions that Lale adds to JSON schema for information specific to the
# machine-learning domain. For a more comprehensive introduction to JSON
# Schema, refer to its
# [Reference](https://json-schema.org/understanding-json-schema/reference/).
# 
# The following table lists kinds of schemas in JSON Schema:
# 
# | Kind of schema | Corresponding type in Python/Lale |
# | ---------------| ---------------------------- |
# | `null`         | `NoneType`, value `None` |
# | `boolean`      | `bool`, values `True` or `False` |
# | `string`       | `str` |
# | `enum`         | See discussion below. |
# | `number`       | `float`, .e.g, `0.1` |
# | `integer`      | `int`, e.g., `42` |
# | `array`        | See discussion below. |
# | `object`       | `dict` with string keys |
# | `anyOf`, `allOf`, `not` | See discussion below. |
# 
# The use of `null`, `boolean`, and `string` is fairly straightforward.  The
# following paragraphs discuss the other kinds of schemas one by one.
# 
# ### 7.1. enum
# 
# In JSON Schema, an enum can contain assorted values including strings,
# numbers, or even `null`. Lale uses enums of strings for categorical
# hyperparameters, such as `'penalty': {'enum': ['l1', 'l2']}` in the earlier
# example. In that case, Lale also automatically declares a corresponding
# Python `enum`.
# When Lale uses enums of other types, it is usually to restrict a
# hyperparameter to a single value, such as `'enum': [None]`.
# 
# ### 7.2. number, integer
# 
# In schemas with `type` set to `number` or `integer`, JSON schema lets users
# specify `minimum`, `maximum`,
# `exclusiveMinimum`, and `exclusiveMaximum`. Lale further extends JSON schema
# with `minimumForOptimizer`, `maximumForOptimizer`, and `distribution`.
# Possible values for the `distribution` are `'uniform'` (the default) and
# `'loguniform'`. In the case of integers, Lale quantizes the distributions
# accordingly.
# 
# ### 7.3. array
# 
# Lale schemas for input and output data make heavy use of the JSON Schema
# `array` type. In this case, Lale schemas are intended to capture logical
# schemas, not physical representations, similarly to how relational databases
# hide physical representations behind a well-formalized abstraction layer.
# Therefore, Lale uses arrays from JSON Schema for several types in Python.
# The most obvious one is a Python `list`. Another common one is a numpy
# [ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html),
# where Lale uses nested arrays to represent each of the dimensions of a
# multi-dimensional array. Lale also has support for `pandas.DataFrame` and
# `pandas.Series`, for which it again uses JSON Schema arrays.
# 
# For arrays, JSON schema lets users specify `items`, `minItems`, and
# `maxItems`. Lale further extends JSON schema with `minItemsForOptimizer` and
# `maxItemsForOptimizer`. Furthermore, Lale supports a `laleType`,
# which can be `'Any'` to locally disable a subschema check, or`'tuple'` to support cases where the Python code requires a
# tuple instead of a list.
# 
# ### 7.4. object
# 
# For objects, JSON schema lets users specify a list `required` of properties
# that must be present, a dictionary `properties` of sub-schemas, and a flag
# `additionalProperties` to indicate whether the object can have additional
# properties beyond the keys of the `properties` dictionary. Lale further
# extends JSON schema with a `relevantToOptimizer` list of properties that
# hyperparameter optimizers should search over.
# 
# For individual properties, Lale supports a `default`, which is inspired by
# and consistent with web API specification practice. It also supports a
# `forOptimizer` flag which defaults to `True` but can be set to `False` to
# hide a particular subschema from the hyperparameter optimizer. For example,
# the number of components for PCA in scikit-learn can be specified as an
# integer or a floating point number, but an optimizer should only explore one
# of these choices. Lale supports a Boolean flag `transient` that, if true,
# elides a hyperparameter during pretty-printing, visualization, or in JSON.
# 
# ### 7.5. allOf, anyOf, not
# 
# As discussed before, in JSON schema, `allOf` is a logical "and", `anyOf` is
# a logical "or", and `not` is a logical negation. The running example from
# earlier already illustrated how to use these for implementing cross-cutting
# constraints. Another use-case that takes advantage of `anyOf` is for
# expressing union types, which arise frequently in scikit-learn. For example,
# here is the schema for `n_components` from PCA:

# In[18]:


n_components_sch = AnyOf(
    [Enum([None], desc="If not set, keep all components."),
     Enum(['mle'], desc="Use Minka's MLE to guess the dimension."),
     Float(minimum=0.0, exclusiveMinimum=True, 
           maximum=1.0, exclusiveMaximum=True, 
           desc='Select the number of components such that the amount of variance '
                'that needs to be explained is greater than the specified percentage.'),
     Int(minimum=1, forOptimizer=False, desc='Number of components to keep.')],
    default=None)

ipython_display(n_components_sch.schema)


# ### 7.6. Schema Metadata
# 
# We encourage users to make their schemas more readable by also including
# common JSON schema metadata such as `$schema` and `description`.  As seen in
# the examples in this document, Lale also extends JSON schema with `tags`
# and `documentation_url`. Finally, in some cases, schema-internal
# duplication can be avoided by cross-references and linking. This is
# supported by off-the-shelf features of JSON schema without requiring
# Lale-specific extensions.