Since the schema of the C
hyperparameter of LR
specifies an
exclusive minimum of zero, passing zero is not valid. Lale internally
calls an off-the-shelf JSON Schema validator when an operator gets
configured with concrete hyperparameter values.
from sklearn.linear_model import LogisticRegression as LR
import lale
lale.wrap_imported_operators()
from lale.settings import set_disable_data_schema_validation, set_disable_hyperparams_schema_validation
#enable schema validation explicitly for the notebook
set_disable_data_schema_validation(False)
set_disable_hyperparams_schema_validation(False)
import jsonschema
import sys
try:
LR(C=0.0)
except jsonschema.ValidationError as e:
message = e.message
print(message, file=sys.stderr)
assert message.startswith('Invalid configuration for LR(C=0.0)')
Invalid configuration for LR(C=0.0) due to invalid value C=0.0. Some possible fixes include: - set C=1.0 Schema of argument C: { "description": "Inverse regularization strength. Smaller values specify stronger regularization.", "type": "number", "distribution": "loguniform", "minimum": 0.0, "exclusiveMinimum": true, "default": 1.0, "minimumForOptimizer": 0.03125, "maximumForOptimizer": 32768, } Invalid value: 0.0
Besides per-hyperparameter types, there are also conditional inter-hyperparameter constraints. These are checked using the same call to an off-the-shelf JSON Schema validator.
try:
LR(LR.enum.solver.sag, LR.enum.penalty.l1)
except jsonschema.ValidationError as e:
message = e.message
print(message, file=sys.stderr)
assert message.find('support only l2 or no penalties') != -1
Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 or no penalties. Some possible fixes include: - set penalty='l2' Schema of failing constraint: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html#constraint-1 Invalid value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None}
There are even constraints that affect three different hyperparameters.
try:
LR(LR.enum.penalty.l2, LR.enum.solver.sag, dual=True)
except jsonschema.ValidationError as e:
message = e.message
print(message, file=sys.stderr)
assert message.find('dual formulation is only implemented for') != -1
Invalid configuration for LR(penalty='l2', solver='sag', dual=True) due to constraint the dual formulation is only implemented for l2 penalty with the liblinear solver. Some possible fixes include: - set dual=False Schema of failing constraint: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html#constraint-2 Invalid value: {'penalty': 'l2', 'solver': 'sag', 'dual': True, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None}
Lale uses JSON Schema validation not only for hyperparameters but also
for data. The dataset train_X
is multimodal: some columns contain
text strings whereas others contain numbers.
import pandas as pd
from lale.datasets.uci.uci_datasets import fetch_drugscom
train_X, train_y, test_X, test_y = fetch_drugscom()
pd.concat([train_X.head(), train_y.head()], axis=1)
drugName | condition | review | date | usefulCount | rating | |
---|---|---|---|---|---|---|
0 | Valsartan | Left Ventricular Dysfunction | "It has no side effect, I take it in combinati... | May 20, 2012 | 27 | 9.0 |
1 | Guanfacine | ADHD | "My son is halfway through his fourth week of ... | April 27, 2010 | 192 | 8.0 |
2 | Lybrel | Birth Control | "I used to take another oral contraceptive, wh... | December 14, 2009 | 17 | 5.0 |
3 | Ortho Evra | Birth Control | "This is my first time using any form of birth... | November 3, 2015 | 10 | 8.0 |
4 | Buprenorphine / naloxone | Opiate Dependence | "Suboxone has completely turned my life around... | November 27, 2016 | 37 | 9.0 |
#Enable the schema validation for data
from lale.settings import set_disable_data_schema_validation
set_disable_data_schema_validation(False)
from lale.pretty_print import ipython_display
ipython_display(lale.datasets.data_schemas.to_schema(train_X))
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "array",
"items": {
"type": "array",
"minItems": 5,
"maxItems": 5,
"items": [
{"description": "drugName", "type": "string"},
{
"description": "condition",
"anyOf": [{"type": "string"}, {"enum": [NaN]}],
},
{"description": "review", "type": "string"},
{"description": "date", "type": "string"},
{"description": "usefulCount", "type": "integer", "minimum": 0},
],
},
"minItems": 161297,
"maxItems": 161297,
}
Since train_X
contains strings but LR
expects only numbers, the
call to fit
reports a type error.
trainable_lr = LR(max_iter=1000)
try:
LR.validate_schema(train_X, train_y)
except ValueError as e:
message = str(e)
print(message, file=sys.stderr)
assert message.startswith('LR.fit() invalid X')
LR.fit() invalid X, the schema of the actual data is not a subschema of the expected schema of the argument. actual_schema = { "$schema": "http://json-schema.org/draft-04/schema#", "type": "array", "items": { "type": "array", "minItems": 5, "maxItems": 5, "items": [ {"description": "drugName", "type": "string"}, { "description": "condition", "anyOf": [{"type": "string"}, {"enum": [NaN]}], }, {"description": "review", "type": "string"}, {"description": "date", "type": "string"}, {"description": "usefulCount", "type": "integer", "minimum": 0}, ], }, "minItems": 161297, "maxItems": 161297, } expected_schema = { "description": "Features; the outer array is over samples.", "type": "array", "items": {"type": "array", "items": {"type": "number"}}, }
Load a pure numerical dataset instead.
from lale.datasets import load_iris_df
(train_X, train_y), (test_X, test_y) = load_iris_df()
train_X.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.0 | 3.4 | 1.6 | 0.4 |
1 | 6.3 | 3.3 | 4.7 | 1.6 |
2 | 5.1 | 3.4 | 1.5 | 0.2 |
3 | 4.8 | 3.0 | 1.4 | 0.1 |
4 | 6.7 | 3.1 | 4.7 | 1.5 |
Training LR with the Iris dataset works fine.
trained_lr = trainable_lr.fit(train_X, train_y)
Lale encourages separating the lifecycle states, here represented
by trainable_lr
vs. trained_lr
. The predict
method should
only be called on a trained model.
predicted = trained_lr.predict(test_X)
print(f'test_y {[*test_y]}')
print(f'predicted {[*predicted]}')
test_y [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0] predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
On the other hand, the predict
method should not be called on a trainable model.
import warnings
warnings.filterwarnings("error", category=DeprecationWarning)
try:
predicted = trainable_lr.predict(test_X)
except DeprecationWarning as w:
message = str(w)
print(message, file=sys.stderr)
assert message.startswith('The `predict` method is deprecated on a trainable')
print(f'test_y {[*test_y]}')
print(f'predicted {[*predicted]}')
test_y [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0] predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
The `predict` method is deprecated on a trainable operator, because the learned coefficients could be accidentally overwritten by retraining. Call `predict` on the trained operator returned by `fit` instead.
LogisticRegression is an estimator and therefore does not have a transform method, even when trained.
try:
trained_lr.transform(train_X)
except AttributeError as e:
message = 'AttributeError'
print(message, file=sys.stderr)
assert message.startswith('AttributeError')