Tutorial: Using and extending the course PyTorch models¶

In [1]:

__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Fall 2020"

Contents¶

Overview
Set-up
General optimization choices
Classifiers
Softmax classifier
A deeper neural classifier
Regression
Linear regression
Deeper Linear Regression
RNN sequence labeling

Overview¶

This repository contains a number of PyTorch modules designed to support our core content and provide tools for homeworks and bake-offs:

In [2]:

%ls torch*

torch_autoencoder.py                torch_rnn_classifier.py
torch_color_describer.py            torch_shallow_neural_classifier.py
torch_glove.py                      torch_tree_nn.py
torch_model_base.py

The goal of the current notebook is to provide some guidance on how you can extend these modules to create original custom systems. Once you get used to how the code is structured, this is sure to be much faster than coding from scratch, and it still allows you a lot of freedom to design new models.

The base class for all the modules is torch_model_base.TorchModelBase. The central role of this class is to provide a very full-featured fit method. See General optimization choices for an overview of the knobs and levers it provides. The interface is generic enough to accommodate a wide range of tasks.

In what follows, we consider three kinds of extension, aiming to highlight general techniques and code patterns:

Classifiers: subclasses using torch_shallow_neural_classifier.py
Regressors: subclasses using torch_model_base.py
RNN-based models: subclasses using torch_rnn_classifier.py

If you are experienced with PyTorch already, you can probably dive right into this notebook. If not, then I recommend our PyTorch tutorial notebook to start.

Set-up¶

In [3]:

import nltk
from sklearn.datasets import load_iris, load_boston
from sklearn.metrics import classification_report, r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
import torch
import torch.nn as nn
from torch_model_base import TorchModelBase
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNDataset, TorchRNNClassifier, TorchRNNModel
import utils

General optimization choices¶

The TorchModelBase has a number of keyword parameters that relate to how models are optimized.

In [4]:

TorchModelBase().params

Out[4]:

['batch_size',
 'max_iter',
 'eta',
 'optimizer_class',
 'l2_strength',
 'gradient_accumulation_steps',
 'max_grad_norm',
 'validation_fraction',
 'early_stopping',
 'n_iter_no_change',
 'warm_start',
 'tol']

For descriptions of what these parameters do, please refer to the docstring for the class.

All of these parameters can be included in hyperparameter optimization runs using tools in sklearn.model_selection, as we'll see below.

Classifiers¶

To create new classifiers, one typically just needs to subclass TorchShallowNeuralClassifier and write a new build_graph method to define your computation graph. Here we illustrate with some representative examples, using the Iris plants dataset for evaluations:

In [5]:

def iris_split():
    dataset = load_iris()
    X = dataset.data
    y = dataset.target
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test

In [6]:

X_cls_train, X_cls_test, y_cls_train, y_cls_test = iris_split()

Softmax classifier¶

For a softmax classifier, we just need to write a simple build_graph method:

In [7]:

class TorchSoftmaxClassifier(TorchShallowNeuralClassifier):

    def build_graph(self):
        return nn.Sequential(
            nn.Linear(self.input_dim, self.n_classes_))

Since the data format and optimization process are the same as for TorchShallowNeuralClassifier, we needn't do anything beyond this.

Quick illustration:

In [8]:

sm_mod = TorchSoftmaxClassifier()

sm_mod

Out[8]:

TorchSoftmaxClassifier(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=<class 'torch.optim.adam.Adam'>,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_dim=50,
	hidden_activation=Tanh())

Note: as you can see here, this model will still accept keyword arguments hidden_dim and hidden_activation, which will be ignored since the graph doesn't use them. I'll leave this minor inconsistency aside.

In [9]:

_ = sm_mod.fit(X_cls_train, y_cls_train)

Finished epoch 1000 of 1000; error is 0.4739058315753937

In [10]:

sm_preds = sm_mod.predict(X_cls_test)

In [11]:

print(classification_report(y_cls_test, sm_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.92      0.73      0.81        15
           2       0.79      0.94      0.86        16

    accuracy                           0.90        50
   macro avg       0.90      0.89      0.89        50
weighted avg       0.91      0.90      0.90        50

TorchModelBase is able to "duck type" standard sklearn estimators, so we can use the functionality from sklearn.model_selection. For example, here we use sklearn.model_selection.cross_validate:

In [12]:

cross_validate(sm_mod, X_cls_train, y_cls_train, cv=5)

Finished epoch 1000 of 1000; error is 0.58722406625747686

Out[12]:

{'fit_time': array([1.90538383, 1.82407284, 1.84190989, 1.83592701, 1.84237123]),
 'score_time': array([0.00169611, 0.0011301 , 0.00174618, 0.00141382, 0.0018909 ]),
 'test_score': array([0.68660969, 0.84242424, 0.84615385, 0.51515152, 0.76911977])}

A deeper neural classifier¶

TorchShallowNeuralClassifier is "shallow" in that it has just one hidden layer of representation. Adding a second is very straightforward. Again, all we really have to do is write a new build_graph, but the implementation below also includes a new __init__ method to allow the user to separately control the sizes of the two hidden layers:

In [13]:

class TorchDeeperNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, hidden_dim1=50, hidden_dim2=50, **base_kwargs):
        super().__init__(**base_kwargs)
        self.hidden_dim1 = hidden_dim1
        self.hidden_dim2 = hidden_dim2
        # Good to remove this to avoid confusion:
        self.params.remove("hidden_dim")
        # Add the new parameters to support model_selection using them:
        self.params += ["hidden_dim1", "hidden_dim2"]

    def build_graph(self):
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim1),
            self.hidden_activation,
            nn.Linear(self.hidden_dim1, self.hidden_dim2),
            self.hidden_activation,
            nn.Linear(self.hidden_dim2, self.n_classes_))

In [14]:

deep_mod = TorchDeeperNeuralClassifier()

deep_mod

Out[14]:

TorchDeeperNeuralClassifier(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=<class 'torch.optim.adam.Adam'>,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_activation=Tanh(),
	hidden_dim1=50,
	hidden_dim2=50)

In [15]:

_ = deep_mod.fit(X_cls_train, y_cls_train)

Finished epoch 1000 of 1000; error is 0.023747699335217476

In [16]:

deep_preds = deep_mod.predict(X_cls_test)

In [17]:

print(classification_report(y_cls_test, deep_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.94      1.00      0.97        15
           2       1.00      0.94      0.97        16

    accuracy                           0.98        50
   macro avg       0.98      0.98      0.98        50
weighted avg       0.98      0.98      0.98        50

To try to find optimal values for the hidden layer dimensionalities, we could do some hyperparameter tuning:

In [18]:

xval = GridSearchCV(
    TorchDeeperNeuralClassifier(),
    param_grid={
        'hidden_dim1': [5, 10],
        'hidden_dim2': [5, 10]})

best_mod = xval.fit(X_cls_train, y_cls_train)

Finished epoch 1000 of 1000; error is 0.060364335775375366

In [19]:

xval.best_score_

Out[19]:

0.9672889488678962

In [20]:

best_mod

Out[20]:

GridSearchCV(estimator=TorchDeeperNeuralClassifier(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=<class 'torch.optim.adam.Adam'>,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_activation=Tanh(),
	hidden_dim1=50,
	hidden_dim2=50),
             param_grid={'hidden_dim1': [5, 10], 'hidden_dim2': [5, 10]})

Regression¶

It is also easy to write regression models. For these, we will TorchModelBase, since some fundamental things are different from the classifiers above.

For illustrations, we'll use a random split of the Boston house prices dataset:

In [21]:

def boston_split():
    dataset = load_boston()
    X = dataset.data
    y = dataset.target
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test

In [22]:

X_reg_train, X_reg_test, y_reg_train, y_reg_test = boston_split()

Linear regression¶

For linear regression, we create an nn.Module subclass:

In [23]:

class TorchLinearRegressionModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.input_dim = input_dim
        self.w = nn.Parameter(torch.zeros(self.input_dim))
        self.b = nn.Parameter(torch.zeros(1))

    def forward(self, X):
        return X.matmul(self.w) + self.b

The estimator itself, a subclass of TorchModelBase, needs the following methods:

build_graph: to use TorchLinearRegressionModel from above.
build_dataset: for processing the data.
predict: for making predictions.
score: technically optional, but required for sklearn.model_selection usage.

In [24]:

class TorchLinearRegresson(TorchModelBase):
    def __init__(self, **base_kwargs):
        super().__init__(**base_kwargs)
        self.loss = nn.MSELoss(reduction="mean")

    def build_graph(self):
        return TorchLinearRegressionModel(self.input_dim)

    def build_dataset(self, X, y=None):
        """
        This function will be used in training (when there is a `y`)
        and in prediction (no `y`). For both cases, we rely on a
        `TensorDataset`.
        """
        X = torch.FloatTensor(X)
        self.input_dim = X.shape[1]
        if y is None:
            dataset = torch.utils.data.TensorDataset(X)
        else:
            y = torch.FloatTensor(y)
            dataset = torch.utils.data.TensorDataset(X, y)
        return dataset

    def predict(self, X, device=None):
        """
        The `_predict` function of the base class handles all the
        details around data formatting. In this case, the
        raw output of `self.model`, as given by
        `TorchLinearRegressionModel.forward` is all we need.
        """
        return self._predict(X, device=device).cpu().numpy()

    def score(self, X, y):
        """
        Follow sklearn in using `r2_score` as the default scorer.
        """
        preds = self.predict(X)
        return r2_score(y, preds)

In [25]:

lr = TorchLinearRegresson()

lr

Out[25]:

TorchLinearRegresson(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=<class 'torch.optim.adam.Adam'>,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05)

In [26]:

_ = lr.fit(X_reg_train, y_reg_train)

Finished epoch 1000 of 1000; error is 52.95167922973633

In [27]:

lr_preds = lr.predict(X_reg_test)

In [28]:

r2_score(y_reg_test, lr_preds)

Out[28]:

0.3236728529459678

Deeper Linear Regression¶

We can extend the subclass we just created to easily create deeper regression models. Here's an example showing that all we need is the deeper nn.Module and a new build_graph method in the main estimator:

In [29]:

class TorchLinearRegressionModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, hidden_activation):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.hidden_activation = hidden_activation
        self.input_layer = nn.Linear(self.input_dim, self.hidden_dim)
        self.w = nn.Parameter(torch.zeros(self.hidden_dim))
        self.b = nn.Parameter(torch.zeros(1))

    def forward(self, X):
        h = self.hidden_activation(self.input_layer(X))
        return h.matmul(self.w) + self.b


class TorchDeeperLinearRegression(TorchLinearRegresson):
    def __init__(self, hidden_dim=20, hidden_activation=nn.Tanh(), **kwargs):
        super().__init__(**kwargs)
        self.hidden_dim = hidden_dim
        self.hidden_activation = hidden_activation
        self.params += ["hidden_dim", "hidden_activation"]

    def build_graph(self):
        return TorchLinearRegressionModel(
            input_dim=self.input_dim,
            hidden_dim=self.hidden_dim,
            hidden_activation=self.hidden_activation)

In [30]:

deep_lr = TorchDeeperLinearRegression()

deep_lr

Out[30]:

TorchDeeperLinearRegression(
	batch_size=1028,
	max_iter=1000,
	eta=0.001,
	optimizer_class=<class 'torch.optim.adam.Adam'>,
	l2_strength=0,
	gradient_accumulation_steps=1,
	max_grad_norm=None,
	validation_fraction=0.1,
	early_stopping=False,
	n_iter_no_change=10,
	warm_start=False,
	tol=1e-05,
	hidden_dim=20,
	hidden_activation=Tanh())

In [31]:

_ = deep_lr.fit(X_reg_train, y_reg_train)

Finished epoch 1000 of 1000; error is 132.6202392578125

In [32]:

deep_lr_preds = deep_lr.predict(X_reg_test)

In [33]:

r2_score(y_reg_test, deep_lr_preds)

Out[33]:

-0.3762662051157306

RNN sequence labeling¶

As a final illustrative example, let's make use of our existing RNN classifier components to create a model that can do full sequence labeling. PyTorch's abstractions concerning how layers interact and how loss functions work make this surprisingly easy.

For examples, we'll use the CoNLL 2002 shared task on named entity labeling in Spanish. NLTK provides an easy interface:

In [34]:

def sequence_dataset():
    train_seq = nltk.corpus.conll2002.iob_sents('esp.train')
    X = [[x[0] for x in seq] for seq in train_seq]
    y = [[x[2] for x in seq] for seq in train_seq]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)
    vocab = sorted({w for seq in X_train for w in seq}) + ["$UNK"]
    return X_train, X_test, y_train, y_test, vocab

In [35]:

 X_seq_train, X_seq_test, y_seq_train, y_seq_test, seq_vocab = sequence_dataset()

Here's are the first few tokens in the first training example:

In [36]:

X_seq_train[0][: 8]

Out[36]:

['La', 'compañía', 'estatal', 'de', 'electricidad', 'de', 'Suecia', ',']

And the corresponding labels:

In [37]:

y_seq_train[0][: 8]

Out[37]:

['O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']

We'll start with the nn.Module subclass we need. In torch_rnn_classifier.py, we already have a pretty generic RNN module: TorchRNNModel. For the classifier use, TorchRNNClassifierModel uses the output of TorchRNNModel to define a classifier based on the final output state. For sequence labeling, we drop TorchRNNClassifierModel and replace it with model that has a classifier on every output state:

In [38]:

class TorchSequenceLabeler(nn.Module):
    def __init__(self, rnn, output_dim):
        super().__init__()
        self.rnn = rnn
        self.output_dim = output_dim
        if self.rnn.bidirectional:
            self.classifier_dim = self.rnn.hidden_dim * 2
        else:
            self.classifier_dim = self.rnn.hidden_dim
        self.classifier_layer = nn.Linear(
            self.classifier_dim, self.output_dim)

    def forward(self, X, seq_lengths):
        outputs, state = self.rnn(X, seq_lengths)
        outputs, seq_length = torch.nn.utils.rnn.pad_packed_sequence(
            outputs, batch_first=True)
        logits = self.classifier_layer(outputs)
        # During training, we need to swap the dimensions of logits
        # to accommodate `nn.CrossEntropyLoss`:
        if self.training:
            return logits.transpose(1, 2)
        else:
            return logits

We won't normally interact with this module directly, but it's perhaps instructive to see how it works on its own:

In [39]:

vocab_size = 4

seq_rnn = TorchRNNModel(vocab_size, embed_dim=4, hidden_dim=5)

In [40]:

seq_module = TorchSequenceLabeler(seq_rnn, vocab_size)

_ = seq_module.eval()

In [41]:

toy_seqs = torch.LongTensor([[0,1,2], [0,2,1]])

seq_lengths = torch.LongTensor([3,3])

This should return two sequences of 4-dimensional vectors – the per-token logits:

In [42]:

seq_module(toy_seqs, seq_lengths)

Out[42]:

tensor([[[ 0.3255,  0.2848,  0.3470, -0.1150],
         [ 0.2264,  0.3246,  0.3123, -0.1394],
         [ 0.1972,  0.3036,  0.3240, -0.0696]],

        [[ 0.3255,  0.2848,  0.3470, -0.1150],
         [ 0.2272,  0.2959,  0.3383, -0.0673],
         [ 0.1895,  0.3257,  0.3078, -0.1153]]], grad_fn=<AddBackward0>)

The remaining tasks concern the new estimator. We need to define the following methods:

build_graph: to use TorchSequenceLabeler
build_dataset: just like what we need for a classifier, but it has to deal with examples as full sequences.
predict_proba: like a classifier predict_proba, but it needs to remove any sequence padding and deal with full sequences
predict: just like a classifier predict method, but defined for sequences.
score: also very much like a classifier score function but designed to deal with sequences

In [43]:

class TorchRNNSequenceLabeler(TorchRNNClassifier):

    def build_graph(self):
        rnn = TorchRNNModel(
            vocab_size=len(self.vocab),
            embedding=self.embedding,
            use_embedding=self.use_embedding,
            embed_dim=self.embed_dim,
            rnn_cell_class=self.rnn_cell_class,
            hidden_dim=self.hidden_dim,
            bidirectional=self.bidirectional,
            freeze_embedding=self.freeze_embedding)
        model = TorchSequenceLabeler(
            rnn=rnn,
            output_dim=self.n_classes_)
        self.embed_dim = rnn.embed_dim
        return model

    def build_dataset(self, X, y=None):
        X, seq_lengths = self._prepare_sequences(X)
        if y is None:
            return TorchRNNDataset(X, seq_lengths)
        else:
            # These are the changes from a regular classifier. All
            # concern the fact that our labels are sequences of labels.
            self.classes_ = sorted({x for seq in y for x in seq})
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            # `y` is a list of tensors of different length. Our Dataset
            # class will turn it into a padding tensor for processing.
            y = [torch.tensor([class2index[label] for label in seq])
                 for seq in y]
            return TorchRNNDataset(X, seq_lengths, y)

    def predict_proba(self, X):
        seq_lengths = [len(ex) for ex in X]
        # The base class does the heavy lifting:
        preds = self._predict(X)
        # Trim to the actual sequence lengths:
        preds = [p[: l] for p, l in zip(preds, seq_lengths)]
        # Use `softmax`; the model doesn't do this because the loss
        # function does it internally.
        probs = [torch.softmax(seq, dim=1) for seq in preds]
        return probs

    def predict(self, X):
        probs = self.predict_proba(X)
        return [[self.classes_[i] for i in seq.argmax(axis=1)] for seq in probs]

    def score(self, X, y):
        preds = self.predict(X)
        flat_preds = [x for seq in preds for x in seq]
        flat_y = [x for seq in y for x in seq]
        return utils.safe_macro_f1(flat_y, flat_preds)

In [44]:

seq_mod = TorchRNNSequenceLabeler(
    seq_vocab,
    early_stopping=True,
    eta=0.001)

In [45]:

%time _ = seq_mod.fit(X_seq_train, y_seq_train)

Stopping after epoch 17. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 8.602030873298645

CPU times: user 24min 41s, sys: 3min 21s, total: 28min 3s
Wall time: 10min 22s

In [46]:

seq_mod.score(X_seq_test, y_seq_test)

Out[46]:

0.11311924082554141