__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Fall 2020"
This repository contains a number of PyTorch modules designed to support our core content and provide tools for homeworks and bake-offs:
%ls torch*
torch_autoencoder.py torch_rnn_classifier.py torch_color_describer.py torch_shallow_neural_classifier.py torch_glove.py torch_tree_nn.py torch_model_base.py
The goal of the current notebook is to provide some guidance on how you can extend these modules to create original custom systems. Once you get used to how the code is structured, this is sure to be much faster than coding from scratch, and it still allows you a lot of freedom to design new models.
The base class for all the modules is torch_model_base.TorchModelBase
. The central role of this class is to provide a very full-featured fit
method. See General optimization choices for an overview of the knobs and levers it provides. The interface is generic enough to accommodate a wide range of tasks.
In what follows, we consider three kinds of extension, aiming to highlight general techniques and code patterns:
torch_shallow_neural_classifier.py
torch_model_base.py
torch_rnn_classifier.py
If you are experienced with PyTorch already, you can probably dive right into this notebook. If not, then I recommend our PyTorch tutorial notebook to start.
import nltk
from sklearn.datasets import load_iris, load_boston
from sklearn.metrics import classification_report, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
import torch
import torch.nn as nn
from torch_model_base import TorchModelBase
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNDataset, TorchRNNClassifier, TorchRNNModel
import utils
The TorchModelBase
has a number of keyword parameters that relate to how models are optimized.
TorchModelBase().params
['batch_size', 'max_iter', 'eta', 'optimizer_class', 'l2_strength', 'gradient_accumulation_steps', 'max_grad_norm', 'validation_fraction', 'early_stopping', 'n_iter_no_change', 'warm_start', 'tol']
For descriptions of what these parameters do, please refer to the docstring for the class.
All of these parameters can be included in hyperparameter optimization runs using tools in sklearn.model_selection
, as we'll see below.
To create new classifiers, one typically just needs to subclass TorchShallowNeuralClassifier
and write a new build_graph
method to define your computation graph. Here we illustrate with some representative examples, using the Iris plants dataset for evaluations:
def iris_split():
dataset = load_iris()
X = dataset.data
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
return X_train, X_test, y_train, y_test
X_cls_train, X_cls_test, y_cls_train, y_cls_test = iris_split()
For a softmax classifier, we just need to write a simple build_graph
method:
class TorchSoftmaxClassifier(TorchShallowNeuralClassifier):
def build_graph(self):
return nn.Sequential(
nn.Linear(self.input_dim, self.n_classes_))
Since the data format and optimization process are the same as for TorchShallowNeuralClassifier
, we needn't do anything beyond this.
Quick illustration:
sm_mod = TorchSoftmaxClassifier()
sm_mod
TorchSoftmaxClassifier( batch_size=1028, max_iter=1000, eta=0.001, optimizer_class=<class 'torch.optim.adam.Adam'>, l2_strength=0, gradient_accumulation_steps=1, max_grad_norm=None, validation_fraction=0.1, early_stopping=False, n_iter_no_change=10, warm_start=False, tol=1e-05, hidden_dim=50, hidden_activation=Tanh())
Note: as you can see here, this model will still accept keyword arguments hidden_dim
and hidden_activation
, which will be ignored since the graph doesn't use them. I'll leave this minor inconsistency aside.
_ = sm_mod.fit(X_cls_train, y_cls_train)
Finished epoch 1000 of 1000; error is 0.4739058315753937
sm_preds = sm_mod.predict(X_cls_test)
print(classification_report(y_cls_test, sm_preds))
precision recall f1-score support 0 1.00 1.00 1.00 19 1 0.92 0.73 0.81 15 2 0.79 0.94 0.86 16 accuracy 0.90 50 macro avg 0.90 0.89 0.89 50 weighted avg 0.91 0.90 0.90 50
TorchModelBase
is able to "duck type" standard sklearn
estimators, so we can use the functionality from sklearn.model_selection
. For example, here we use sklearn.model_selection.cross_validate
:
cross_validate(sm_mod, X_cls_train, y_cls_train, cv=5)
Finished epoch 1000 of 1000; error is 0.58722406625747686
{'fit_time': array([1.90538383, 1.82407284, 1.84190989, 1.83592701, 1.84237123]), 'score_time': array([0.00169611, 0.0011301 , 0.00174618, 0.00141382, 0.0018909 ]), 'test_score': array([0.68660969, 0.84242424, 0.84615385, 0.51515152, 0.76911977])}
TorchShallowNeuralClassifier
is "shallow" in that it has just one hidden layer of representation. Adding a second is very straightforward. Again, all we really have to do is write a new build_graph
, but the implementation below also includes a new __init__
method to allow the user to separately control the sizes of the two hidden layers:
class TorchDeeperNeuralClassifier(TorchShallowNeuralClassifier):
def __init__(self, hidden_dim1=50, hidden_dim2=50, **base_kwargs):
super().__init__(**base_kwargs)
self.hidden_dim1 = hidden_dim1
self.hidden_dim2 = hidden_dim2
# Good to remove this to avoid confusion:
self.params.remove("hidden_dim")
# Add the new parameters to support model_selection using them:
self.params += ["hidden_dim1", "hidden_dim2"]
def build_graph(self):
return nn.Sequential(
nn.Linear(self.input_dim, self.hidden_dim1),
self.hidden_activation,
nn.Linear(self.hidden_dim1, self.hidden_dim2),
self.hidden_activation,
nn.Linear(self.hidden_dim2, self.n_classes_))
deep_mod = TorchDeeperNeuralClassifier()
deep_mod
TorchDeeperNeuralClassifier( batch_size=1028, max_iter=1000, eta=0.001, optimizer_class=<class 'torch.optim.adam.Adam'>, l2_strength=0, gradient_accumulation_steps=1, max_grad_norm=None, validation_fraction=0.1, early_stopping=False, n_iter_no_change=10, warm_start=False, tol=1e-05, hidden_activation=Tanh(), hidden_dim1=50, hidden_dim2=50)
_ = deep_mod.fit(X_cls_train, y_cls_train)
Finished epoch 1000 of 1000; error is 0.023747699335217476
deep_preds = deep_mod.predict(X_cls_test)
print(classification_report(y_cls_test, deep_preds))
precision recall f1-score support 0 1.00 1.00 1.00 19 1 0.94 1.00 0.97 15 2 1.00 0.94 0.97 16 accuracy 0.98 50 macro avg 0.98 0.98 0.98 50 weighted avg 0.98 0.98 0.98 50
To try to find optimal values for the hidden layer dimensionalities, we could do some hyperparameter tuning:
xval = GridSearchCV(
TorchDeeperNeuralClassifier(),
param_grid={
'hidden_dim1': [5, 10],
'hidden_dim2': [5, 10]})
best_mod = xval.fit(X_cls_train, y_cls_train)
Finished epoch 1000 of 1000; error is 0.060364335775375366
xval.best_score_
0.9672889488678962
best_mod
GridSearchCV(estimator=TorchDeeperNeuralClassifier( batch_size=1028, max_iter=1000, eta=0.001, optimizer_class=<class 'torch.optim.adam.Adam'>, l2_strength=0, gradient_accumulation_steps=1, max_grad_norm=None, validation_fraction=0.1, early_stopping=False, n_iter_no_change=10, warm_start=False, tol=1e-05, hidden_activation=Tanh(), hidden_dim1=50, hidden_dim2=50), param_grid={'hidden_dim1': [5, 10], 'hidden_dim2': [5, 10]})
It is also easy to write regression models. For these, we will TorchModelBase
, since some fundamental things are different from the classifiers above.
For illustrations, we'll use a random split of the Boston house prices dataset:
def boston_split():
dataset = load_boston()
X = dataset.data
y = dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
return X_train, X_test, y_train, y_test
X_reg_train, X_reg_test, y_reg_train, y_reg_test = boston_split()
For linear regression, we create an nn.Module
subclass:
class TorchLinearRegressionModel(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.input_dim = input_dim
self.w = nn.Parameter(torch.zeros(self.input_dim))
self.b = nn.Parameter(torch.zeros(1))
def forward(self, X):
return X.matmul(self.w) + self.b
The estimator itself, a subclass of TorchModelBase
, needs the following methods:
build_graph
: to use TorchLinearRegressionModel
from above.build_dataset
: for processing the data.predict
: for making predictions.score
: technically optional, but required for sklearn.model_selection
usage.class TorchLinearRegresson(TorchModelBase):
def __init__(self, **base_kwargs):
super().__init__(**base_kwargs)
self.loss = nn.MSELoss(reduction="mean")
def build_graph(self):
return TorchLinearRegressionModel(self.input_dim)
def build_dataset(self, X, y=None):
"""
This function will be used in training (when there is a `y`)
and in prediction (no `y`). For both cases, we rely on a
`TensorDataset`.
"""
X = torch.FloatTensor(X)
self.input_dim = X.shape[1]
if y is None:
dataset = torch.utils.data.TensorDataset(X)
else:
y = torch.FloatTensor(y)
dataset = torch.utils.data.TensorDataset(X, y)
return dataset
def predict(self, X, device=None):
"""
The `_predict` function of the base class handles all the
details around data formatting. In this case, the
raw output of `self.model`, as given by
`TorchLinearRegressionModel.forward` is all we need.
"""
return self._predict(X, device=device).cpu().numpy()
def score(self, X, y):
"""
Follow sklearn in using `r2_score` as the default scorer.
"""
preds = self.predict(X)
return r2_score(y, preds)
lr = TorchLinearRegresson()
lr
TorchLinearRegresson( batch_size=1028, max_iter=1000, eta=0.001, optimizer_class=<class 'torch.optim.adam.Adam'>, l2_strength=0, gradient_accumulation_steps=1, max_grad_norm=None, validation_fraction=0.1, early_stopping=False, n_iter_no_change=10, warm_start=False, tol=1e-05)
_ = lr.fit(X_reg_train, y_reg_train)
Finished epoch 1000 of 1000; error is 52.95167922973633
lr_preds = lr.predict(X_reg_test)
r2_score(y_reg_test, lr_preds)
0.3236728529459678
We can extend the subclass we just created to easily create deeper regression models. Here's an example showing that all we need is the deeper nn.Module
and a new build_graph
method in the main estimator:
class TorchLinearRegressionModel(nn.Module):
def __init__(self, input_dim, hidden_dim, hidden_activation):
super().__init__()
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.hidden_activation = hidden_activation
self.input_layer = nn.Linear(self.input_dim, self.hidden_dim)
self.w = nn.Parameter(torch.zeros(self.hidden_dim))
self.b = nn.Parameter(torch.zeros(1))
def forward(self, X):
h = self.hidden_activation(self.input_layer(X))
return h.matmul(self.w) + self.b
class TorchDeeperLinearRegression(TorchLinearRegresson):
def __init__(self, hidden_dim=20, hidden_activation=nn.Tanh(), **kwargs):
super().__init__(**kwargs)
self.hidden_dim = hidden_dim
self.hidden_activation = hidden_activation
self.params += ["hidden_dim", "hidden_activation"]
def build_graph(self):
return TorchLinearRegressionModel(
input_dim=self.input_dim,
hidden_dim=self.hidden_dim,
hidden_activation=self.hidden_activation)
deep_lr = TorchDeeperLinearRegression()
deep_lr
TorchDeeperLinearRegression( batch_size=1028, max_iter=1000, eta=0.001, optimizer_class=<class 'torch.optim.adam.Adam'>, l2_strength=0, gradient_accumulation_steps=1, max_grad_norm=None, validation_fraction=0.1, early_stopping=False, n_iter_no_change=10, warm_start=False, tol=1e-05, hidden_dim=20, hidden_activation=Tanh())
_ = deep_lr.fit(X_reg_train, y_reg_train)
Finished epoch 1000 of 1000; error is 132.6202392578125
deep_lr_preds = deep_lr.predict(X_reg_test)
r2_score(y_reg_test, deep_lr_preds)
-0.3762662051157306
As a final illustrative example, let's make use of our existing RNN classifier components to create a model that can do full sequence labeling. PyTorch's abstractions concerning how layers interact and how loss functions work make this surprisingly easy.
For examples, we'll use the CoNLL 2002 shared task on named entity labeling in Spanish. NLTK provides an easy interface:
def sequence_dataset():
train_seq = nltk.corpus.conll2002.iob_sents('esp.train')
X = [[x[0] for x in seq] for seq in train_seq]
y = [[x[2] for x in seq] for seq in train_seq]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
vocab = sorted({w for seq in X_train for w in seq}) + ["$UNK"]
return X_train, X_test, y_train, y_test, vocab
X_seq_train, X_seq_test, y_seq_train, y_seq_test, seq_vocab = sequence_dataset()
Here's are the first few tokens in the first training example:
X_seq_train[0][: 8]
['La', 'compañía', 'estatal', 'de', 'electricidad', 'de', 'Suecia', ',']
And the corresponding labels:
y_seq_train[0][: 8]
['O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']
We'll start with the nn.Module
subclass we need. In torch_rnn_classifier.py
, we already have a pretty generic RNN module: TorchRNNModel
. For the classifier use, TorchRNNClassifierModel
uses the output of TorchRNNModel
to define a classifier based on the final output state. For sequence labeling, we drop TorchRNNClassifierModel
and replace it with model that has a classifier on every output state:
class TorchSequenceLabeler(nn.Module):
def __init__(self, rnn, output_dim):
super().__init__()
self.rnn = rnn
self.output_dim = output_dim
if self.rnn.bidirectional:
self.classifier_dim = self.rnn.hidden_dim * 2
else:
self.classifier_dim = self.rnn.hidden_dim
self.classifier_layer = nn.Linear(
self.classifier_dim, self.output_dim)
def forward(self, X, seq_lengths):
outputs, state = self.rnn(X, seq_lengths)
outputs, seq_length = torch.nn.utils.rnn.pad_packed_sequence(
outputs, batch_first=True)
logits = self.classifier_layer(outputs)
# During training, we need to swap the dimensions of logits
# to accommodate `nn.CrossEntropyLoss`:
if self.training:
return logits.transpose(1, 2)
else:
return logits
We won't normally interact with this module directly, but it's perhaps instructive to see how it works on its own:
vocab_size = 4
seq_rnn = TorchRNNModel(vocab_size, embed_dim=4, hidden_dim=5)
seq_module = TorchSequenceLabeler(seq_rnn, vocab_size)
_ = seq_module.eval()
toy_seqs = torch.LongTensor([[0,1,2], [0,2,1]])
seq_lengths = torch.LongTensor([3,3])
This should return two sequences of 4-dimensional vectors – the per-token logits:
seq_module(toy_seqs, seq_lengths)
tensor([[[ 0.3255, 0.2848, 0.3470, -0.1150], [ 0.2264, 0.3246, 0.3123, -0.1394], [ 0.1972, 0.3036, 0.3240, -0.0696]], [[ 0.3255, 0.2848, 0.3470, -0.1150], [ 0.2272, 0.2959, 0.3383, -0.0673], [ 0.1895, 0.3257, 0.3078, -0.1153]]], grad_fn=<AddBackward0>)
The remaining tasks concern the new estimator. We need to define the following methods:
build_graph
: to use TorchSequenceLabeler
build_dataset
: just like what we need for a classifier, but it has to deal with examples as full sequences.predict_proba
: like a classifier predict_proba
, but it needs to remove any sequence padding and deal with full sequencespredict
: just like a classifier predict
method, but defined for sequences.score
: also very much like a classifier score
function but designed to deal with sequencesclass TorchRNNSequenceLabeler(TorchRNNClassifier):
def build_graph(self):
rnn = TorchRNNModel(
vocab_size=len(self.vocab),
embedding=self.embedding,
use_embedding=self.use_embedding,
embed_dim=self.embed_dim,
rnn_cell_class=self.rnn_cell_class,
hidden_dim=self.hidden_dim,
bidirectional=self.bidirectional,
freeze_embedding=self.freeze_embedding)
model = TorchSequenceLabeler(
rnn=rnn,
output_dim=self.n_classes_)
self.embed_dim = rnn.embed_dim
return model
def build_dataset(self, X, y=None):
X, seq_lengths = self._prepare_sequences(X)
if y is None:
return TorchRNNDataset(X, seq_lengths)
else:
# These are the changes from a regular classifier. All
# concern the fact that our labels are sequences of labels.
self.classes_ = sorted({x for seq in y for x in seq})
self.n_classes_ = len(self.classes_)
class2index = dict(zip(self.classes_, range(self.n_classes_)))
# `y` is a list of tensors of different length. Our Dataset
# class will turn it into a padding tensor for processing.
y = [torch.tensor([class2index[label] for label in seq])
for seq in y]
return TorchRNNDataset(X, seq_lengths, y)
def predict_proba(self, X):
seq_lengths = [len(ex) for ex in X]
# The base class does the heavy lifting:
preds = self._predict(X)
# Trim to the actual sequence lengths:
preds = [p[: l] for p, l in zip(preds, seq_lengths)]
# Use `softmax`; the model doesn't do this because the loss
# function does it internally.
probs = [torch.softmax(seq, dim=1) for seq in preds]
return probs
def predict(self, X):
probs = self.predict_proba(X)
return [[self.classes_[i] for i in seq.argmax(axis=1)] for seq in probs]
def score(self, X, y):
preds = self.predict(X)
flat_preds = [x for seq in preds for x in seq]
flat_y = [x for seq in y for x in seq]
return utils.safe_macro_f1(flat_y, flat_preds)
seq_mod = TorchRNNSequenceLabeler(
seq_vocab,
early_stopping=True,
eta=0.001)
%time _ = seq_mod.fit(X_seq_train, y_seq_train)
Stopping after epoch 17. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 8.602030873298645
CPU times: user 24min 41s, sys: 3min 21s, total: 28min 3s Wall time: 10min 22s
seq_mod.score(X_seq_test, y_seq_test)
0.11311924082554141