Since the schema of the C
hyperparameter of LR
specifies an
exclusive minimum of zero, passing zero is not valid. Lale internally
calls an off-the-shelf JSON Schema validator when an operator gets
configured with concrete hyperparameter values.
from sklearn.linear_model import LogisticRegression as LR
import lale
lale.wrap_imported_operators()
import jsonschema
import sys
try:
LR(C=0.0)
except jsonschema.ValidationError as e:
message = e.message
print(message, file=sys.stderr)
assert message.startswith('Invalid configuration for LR(C=0.0)')
Invalid configuration for LR(C=0.0) due to invalid value C=0.0. Schema of argument C: { 'description': 'Inverse regularization strength. Smaller values specify stronger regularization.', 'type': 'number', 'distribution': 'loguniform', 'minimum': 0.0, 'exclusiveMinimum': true, 'default': 1.0, 'minimumForOptimizer': 0.03125, 'maximumForOptimizer': 32768} Value: 0.0
Besides per-hyperparameter types, there are also conditional inter-hyperparameter constraints. These are checked using the same call to an off-the-shelf JSON Schema validator.
try:
LR(LR.solver.sag, LR.penalty.l1)
except jsonschema.ValidationError as e:
message = e.message
print(message, file=sys.stderr)
assert message.find('support only l2 penalties') != -1
Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 penalties. Schema of constraint 1: { 'description': 'The newton-cg, sag, and lbfgs solvers support only l2 penalties.', 'anyOf': [ { 'type': 'object', 'properties': { 'solver': { 'not': { 'enum': ['newton-cg', 'sag', 'lbfgs']}}}}, { 'type': 'object', 'properties': { 'penalty': { 'enum': ['l2']}}}]} Value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'ovr', 'verbose': 0, 'warm_start': False, 'n_jobs': None}
There are even constraints that affect three different hyperparameters.
try:
LR(LR.penalty.l2, LR.solver.sag, dual=True)
except jsonschema.ValidationError as e:
message = e.message
print(message, file=sys.stderr)
assert message.find('dual formulation is only implemented for') != -1
Invalid configuration for LR(penalty='l2', solver='sag', dual=True) due to constraint the dual formulation is only implemented for l2 penalty with the liblinear solver. Schema of constraint 2: { 'description': 'The dual formulation is only implemented for l2 penalty with the liblinear solver.', 'anyOf': [ { 'type': 'object', 'properties': { 'dual': { 'enum': [false]}}}, { 'type': 'object', 'properties': { 'penalty': { 'enum': ['l2']}, 'solver': { 'enum': ['liblinear']}}}]} Value: {'penalty': 'l2', 'solver': 'sag', 'dual': True, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'ovr', 'verbose': 0, 'warm_start': False, 'n_jobs': None}
Lale uses JSON Schema validation not only for hyperparameters but also
for data. The dataset train_X
is multimodal: some columns contain
text strings whereas others contain numbers.
import pandas as pd
from lale.datasets.uci.uci_datasets import fetch_drugscom
train_X, train_y, test_X, test_y = fetch_drugscom()
pd.concat([train_X.head(), train_y.head()], axis=1)
drugName | condition | review | date | usefulCount | rating | |
---|---|---|---|---|---|---|
0 | Valsartan | Left Ventricular Dysfunction | "It has no side effect, I take it in combinati... | May 20, 2012 | 27 | 9.0 |
1 | Guanfacine | ADHD | "My son is halfway through his fourth week of ... | April 27, 2010 | 192 | 8.0 |
2 | Lybrel | Birth Control | "I used to take another oral contraceptive, wh... | December 14, 2009 | 17 | 5.0 |
3 | Ortho Evra | Birth Control | "This is my first time using any form of birth... | November 3, 2015 | 10 | 8.0 |
4 | Buprenorphine / naloxone | Opiate Dependence | "Suboxone has completely turned my life around... | November 27, 2016 | 37 | 9.0 |
from lale.pretty_print import ipython_display
ipython_display(lale.datasets.data_schemas.to_schema(train_X))
{
'type': 'array',
'items': {
'type': 'array',
'minItems': 5,
'maxItems': 5,
'items': [
{ 'description': 'drugName',
'type': 'string'},
{ 'description': 'condition',
'anyOf': [
{ 'type': 'string'},
{ 'enum': [NaN]}]},
{ 'description': 'review',
'type': 'string'},
{ 'description': 'date',
'type': 'string'},
{ 'description': 'usefulCount',
'type': 'integer',
'minimum': 0}]},
'minItems': 161297,
'maxItems': 161297}
Since train_X
contains strings but LR
expects only numbers, the
call to fit
reports a type error.
trainable_lr = LR()
try:
LR.validate_schema(train_X, train_y)
except ValueError as e:
message = str(e)
print(message, file=sys.stderr)
assert message.startswith('LR.fit() invalid X')
LR.fit() invalid X: Expected sub to be a subschema of super. sub = { 'type': 'array', 'items': { 'type': 'array', 'minItems': 5, 'maxItems': 5, 'items': [ { 'description': 'drugName', 'type': 'string'}, { 'description': 'condition', 'anyOf': [ { 'type': 'string'}, { 'enum': [NaN]}]}, { 'description': 'review', 'type': 'string'}, { 'description': 'date', 'type': 'string'}, { 'description': 'usefulCount', 'type': 'integer', 'minimum': 0}]}, 'minItems': 161297, 'maxItems': 161297} super = { 'description': 'Features; the outer array is over samples.', 'type': 'array', 'items': { 'type': 'array', 'items': { 'type': 'number'}}}
Load a pure numerical dataset instead.
from lale.datasets import load_iris_df
(train_X, train_y), (test_X, test_y) = load_iris_df()
train_X.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.0 | 3.4 | 1.6 | 0.4 |
1 | 6.3 | 3.3 | 4.7 | 1.6 |
2 | 5.1 | 3.4 | 1.5 | 0.2 |
3 | 4.8 | 3.0 | 1.4 | 0.1 |
4 | 6.7 | 3.1 | 4.7 | 1.5 |
Training LR with the Iris dataset works fine.
trained_lr = trainable_lr.fit(train_X, train_y)
Lale encourages separating the lifecycle states, here represented
by trainable_lr
vs. trained_lr
. The predict
method should
only be called on a trained model.
predicted = trained_lr.predict(test_X)
print(f'test_y {[*test_y]}')
print(f'predicted {[*predicted]}')
test_y [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0] predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
On the other hand, the predict
method should not be called on a trainable model.
import warnings
warnings.filterwarnings("error", category=DeprecationWarning)
try:
predicted = trainable_lr.predict(test_X)
except DeprecationWarning as w:
message = str(w)
print(message, file=sys.stderr)
assert message.startswith('The `predict` method is deprecated on a trainable')
print(f'test_y {[*test_y]}')
print(f'predicted {[*predicted]}')
test_y [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0] predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
The `predict` method is deprecated on a trainable operator, because the learned coefficients could be accidentally overwritten by retraining. Call `predict` on the trained operator returned by `fit` instead.
LogisticRegression is an estimator and therefore does not have a transform method, even when trained.
try:
trained_lr.transform(train_X)
except AttributeError as e:
message = 'AttributeError'
print(message, file=sys.stderr)
assert message.startswith('AttributeError')
AttributeError