Notebook

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [ ]:

#! pip install transformers datasets huggingface_hub

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then uncomment the following cell and input your token:

In [ ]:

from huggingface_hub import notebook_login

notebook_login()

Then you need to install Git-LFS and setup Git if you haven't already. Uncomment the following instructions and adapt with your name and email:

In [1]:

# !apt install git-lfs
# !git config --global user.email "you@example.com"
# !git config --global user.name "Your Name"

Make sure your version of Transformers is at least 4.16.0 since some of the functionality we use was introduced in that version:

In [1]:

import transformers

print(transformers.__version__)

4.21.0.dev0

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [ ]:

from transformers.utils import send_example_telemetry

send_example_telemetry("multiple_choice_notebook", framework="tensorflow")

Fine-tuning a model on a multiple choice task¶

In this notebook, we will see how to fine-tune one of the 🤗 Transformers model on a multiple-choice task. In a multiple-choice task, multiple answers or continuations are provided for each input, and the model must guess which is most plausible. The dataset used here is SWAG but you can adapt the pre-processing to any other multiple choice dataset you like, or your own data. SWAG is a dataset about commonsense reasoning, where each example describes a situation and proposes four continuations that could follow it.

This notebook is built to run with any model checkpoint from the Model Hub as long as that model has a version with a mutiple choice head. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those two parameters, then the rest of the notebook should run smoothly.

In [2]:

model_checkpoint = "bert-base-cased"
batch_size = 16

Loading the dataset¶

We will use the 🤗 Datasets library to download the data. This can be easily done with the load_dataset function.

In [3]:

from datasets import load_dataset, load_metric

load_dataset will cache the dataset to avoid downloading it again the next time you run this cell.

In [4]:

datasets = load_dataset("swag", "regular")

Reusing dataset swag (/home/matt/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c)

  0%|          | 0/3 [00:00<?, ?it/s]

The dataset object itself is DatasetDict, which contains one key for the training, validation and test set.

In [5]:

datasets

Out[5]:

DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})

To access an actual element, you need to select a split first, then give an index:

In [6]:

datasets["train"][0]

Out[6]:

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [7]:

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [8]:

show_random_elements(datasets["train"])

	video-id	fold-ind	startphrase	sent1	sent2	gold-source	ending0	ending1	ending2	ending3	label
0	lsmdc0005_Chinatown-48562	957	In a moment Cross can be seen, looking toward camera. He	In a moment Cross can be seen, looking toward camera.	He	gen	closes the pantry door a little bit more.	crosses to a door to a thick dark case that closes clear.	looks up at an empty window in the club, then hangs up, embarrassed on the cold.	is carrying a trail, which wraps around his ankle.	0
1	anetv_W6y6Vmk5edg	11587	A gymnast is seen leaning across a long beam and begins performing a gymnastics routine. The girl	A gymnast is seen leaning across a long beam and begins performing a gymnastics routine.	The girl	gen	performs several kickboxing while moving her arms in and down.	does a gymnastic routine on the balance beam and herself on to the balance beam.	move through several flips and flips on the beam.	begins jumping up and down on the mat.	2
2	lsmdc3042_KARATE_KID-19795	12868	He stares off, vacantly. Someone	He stares off, vacantly.	Someone	gold	puts a three - slim guard into his mouth and kisses him.	is in slow motion, his knees still scarred by the battle.	flips across the dark to the car.	paces toward him and glances at the wreck.	3
3	lsmdc0026_The_Big_Fish-62739	18560	Someone walks into the river, up to his knees. He	Someone walks into the river, up to his knees.	He	gold	is pulled away, and his shoulders upset.	finds his hand seething, standing around the cab, raising his head.	turns back so his father can face the crowd.	moves vigorously he then turns and does a little push the bubbles into his mouth, leaving his mouth open, breathing ringed.	2
4	anetv_ZPVrC5185NM	11852	She pushes the baby back and forth on a swing. The baby	She pushes the baby back and forth on a swing.	The baby	gold	laughs and smiles as she swings.	comes back and takes a few flips.	hops backwards into the swing.	walks back inside and throws it out.	0
5	lsmdc3024_EASY_A-11617	11559	She unbuckles her seat belt and they share a tight embrace. Someone	She unbuckles her seat belt and they share a tight embrace.	Someone	gen	pulls a knapsack from a shelf by someone's coat pocket as she opens one of the boxes.	sits in the limo.	slips his arm around her waist.	blinks, then leans forward and presses his lips to her chin.	3
6	lsmdc0050_Indiana_Jones_and_the_last_crusade-70715	3232	Now he sees the pendulum has been guarding a small corridor which turns a corner to the left fifty yards ahead. Wooden wheels	Now he sees the pendulum has been guarding a small corridor which turns a corner to the left fifty yards ahead.	Wooden wheels	gold	turn - - the mechanism controlling the spinning blades.	follow through the rising foam.	cranks back in an electronic system.	press against a gold and metal gate.	0
7	anetv_fZQS02Ypca4	847	A young girl is seen sitting and speaking to the camera while using a brush to powder her face. A color pallet is held up next to her and she	A young girl is seen sitting and speaking to the camera while using a brush to powder her face.	A color pallet is held up next to her and she	gold	begins rubbing the powder all over her eyes.	uses the brush on brush drys on her hair while continuing to speak to the camera.	applies the sides and pans far away.	begins brushing her teeth and showing an image with the brush in her long hair.	0
8	lsmdc0051_Men_in_black-70855	14113	Someone barks a few orders to them. He	Someone barks a few orders to them.	He	gen	has a hair mustache.	pretends to sit back.	puts his slacks back on.	sends aiming a heft torch.	1
9	anetv_76RoR_LbIzQ	16378	Pictures of an office is shown. A woman	Pictures of an office is shown.	A woman	gold	is explaining how she is wiping the car drive.	is sitting on a long swing holding a stick.	is doing another woman's hair.	is seen playing a song on a stage.	2

Each example in the dataset has a context composed of a first sentence (sent1) and an introduction to the second sentence (sent2). Then four possible endings are given (ending0, ending1, ending2 and ending3) and the model must pick the right one (label). The following function lets us visualize a given example a bit better:

In [9]:

def show_one(example):
    print(f"Context: {example['sent1']}")
    print(f"  A - {example['sent2']} {example['ending0']}")
    print(f"  B - {example['sent2']} {example['ending1']}")
    print(f"  C - {example['sent2']} {example['ending2']}")
    print(f"  D - {example['sent2']} {example['ending3']}")
    print(f"\nGround truth: option {['A', 'B', 'C', 'D'][example['label']]}")

In [10]:

show_one(datasets["train"][0])

Context: Members of the procession walk down the street holding small horn brass instruments.
  A - A drum line passes by walking down the street playing their instruments.
  B - A drum line has heard approaching them.
  C - A drum line arrives and they're outside dancing and asleep.
  D - A drum line turns the lead singer watches the performance.

Ground truth: option A

In [11]:

show_one(datasets["train"][15])

Context: Now it's someone's turn to rain blades on his opponent.
  A - Someone pats his shoulder and spins wildly.
  B - Someone lunges forward through the window.
  C - Someone falls to the ground.
  D - Someone rolls up his fast run from the water and tosses in the sky.

Ground truth: option C

Preprocessing the data¶

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer which will (as the name indicates) tokenize the inputs, convert the tokens to their corresponding IDs in the pretrained vocabulary and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:

we get a tokenizer that corresponds to the model architecture we want to use,
we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [12]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on one sentence or a pair of sentences:

In [13]:

tokenizer("Hello, this is a sentence!", "And this sentence goes with it.")

Out[13]:

{'input_ids': [101, 8667, 117, 1142, 1110, 170, 5650, 106, 102, 1262, 1142, 5650, 2947, 1114, 1122, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later). You can learn more about them in this tutorial if you're interested.

We can now write the function that will preprocess our samples. The tricky part is to put all the possible pairs of sentences into two big lists before passing them to the tokenizer, then un-flatten the result so that each example has four input ids, attentions masks, etc.

When calling the tokenizer, we use the argument truncation=True. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model.

In [14]:

ending_names = ["ending0", "ending1", "ending2", "ending3"]


def preprocess_function(examples):
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    # Grab all second sentences possible for each context.
    question_headers = examples["sent2"]
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names]
        for i, header in enumerate(question_headers)
    ]

    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten
    return {
        k: [v[i : i + 4] for i in range(0, len(v), 4)]
        for k, v in tokenized_examples.items()
    }

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists of lists for each key: a list of all examples (here 5), then a list of all choices (4) and a list of input IDs (length varying here since we did not apply any padding):

In [15]:

examples = datasets["train"][:5]
features = preprocess_function(examples)
print(
    len(features["input_ids"]),
    len(features["input_ids"][0]),
    [len(x) for x in features["input_ids"][0]],
)

5 4 [30, 25, 30, 28]

To check we didn't do anything wrong when grouping all possibilites and unflattening them, let's have a look at the decoded inputs for a given example:

In [16]:

idx = 3
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(4)]

Out[16]:

['[CLS] A drum line passes by walking down the street playing their instruments. [SEP] Members of the procession are playing ping pong and celebrating one left each in quick. [SEP]',
 '[CLS] A drum line passes by walking down the street playing their instruments. [SEP] Members of the procession wait slowly towards the cadets. [SEP]',
 '[CLS] A drum line passes by walking down the street playing their instruments. [SEP] Members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions. [SEP]',
 '[CLS] A drum line passes by walking down the street playing their instruments. [SEP] Members of the procession play and go back and forth hitting the drums while the audience claps for them. [SEP]']

We can compare it to the ground truth:

In [17]:

show_one(datasets["train"][3])

Context: A drum line passes by walking down the street playing their instruments.
  A - Members of the procession are playing ping pong and celebrating one left each in quick.
  B - Members of the procession wait slowly towards the cadets.
  C - Members of the procession makes a square call and ends by jumping down into snowy streets where fans begin to take their positions.
  D - Members of the procession play and go back and forth hitting the drums while the audience claps for them.

Ground truth: option D

This seems alright, so we can apply this function on all the examples in our dataset. All we need to do is to use the map method of the dataset object we created earlier. This will apply the function on all the elements of all the splits in dataset, so our training, validation and testing data will be preprocessed in one single command.

In [18]:

encoded_datasets = datasets.map(preprocess_function, batched=True)

Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c/cache-1931e2c0368a1bc4.arrow
Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c/cache-df4adf2eee309953.arrow
Loading cached processed dataset at /home/matt/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c/cache-41eb40ca99e099f0.arrow

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, so you can pass load_from_cache_file=False in the call to map to not use the cached files and force the preprocessing to be applied again.

Note that we passed batched=True to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to handle the texts in a batch concurrently.

Fine-tuning the model¶

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our task is about multiple choice, we use the AutoModelForMultipleChoice class. Like with the tokenizer, the from_pretrained method will download and cache the model for us.

In [19]:

from transformers import TFAutoModelForMultipleChoice

model = TFAutoModelForMultipleChoice.from_pretrained(model_checkpoint)

2022-07-21 13:43:42.021257: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.060063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.061084: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.062943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-21 13:43:42.066328: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.067335: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.068316: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.798900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.799597: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.800270: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-21 13:43:42.800895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21750 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:21:00.0, compute capability: 8.6
2022-07-21 13:43:43.912195: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
All model checkpoint layers were used when initializing TFBertForMultipleChoice.

Some layers of TFBertForMultipleChoice were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

The warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some others (the pre_classifier and classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, and so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

Next, we set some names and hyperparameters for the model. The first two variables are used so we can push the model to the Hub at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of push_to_hub_model_id to something you would prefer.

In [20]:

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-swag"

learning_rate = 5e-5
batch_size = batch_size
num_train_epochs = 2
weight_decay = 0.01

Next we need to tell our Dataset how to form batches from the pre-processed inputs. We haven't done any padding yet because we will pad each batch to the maximum length inside the batch (instead of doing so with the maximum length of the whole dataset). This will be the job of the data collator. A data collator takes a list of examples and converts them to a batch (by, in our case, applying padding). Since there is no data collator in the library that works on our specific problem, we will write one, adapted from the DataCollatorWithPadding:

In [21]:

from dataclasses import dataclass
from transformers.tokenization_utils_base import (
    PreTrainedTokenizerBase,
    PaddingStrategy,
)
from typing import Optional, Union
import tensorflow as tf


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)]
            for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="np",
        )

        # Un-flatten
        batch = {
            k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()
        }
        # Add back labels
        batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
        return batch

When called on a list of examples, it will flatten all the inputs/attentions masks etc. in big lists that it will pass to the tokenizer.pad method. This will return a dictionary with big tensors (of shape (batch_size * 4) x seq_length) that we then unflatten.

We can check this data collator works on a list of features, we just have to make sure to remove all features that are not inputs accepted by our model:

In [22]:

accepted_keys = ["input_ids", "attention_mask", "label"]
features = [
    {k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys}
    for i in range(10)
]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

In [23]:

encoded_datasets["train"].features["attention_mask"].feature.feature

Out[23]:

Value(dtype='int8', id=None)

Again, all those flatten/un-flattens are sources of potential errors so let's make another sanity check on our inputs:

In [24]:

[tokenizer.decode(batch["input_ids"][8][i].numpy().tolist()) for i in range(4)]

Out[24]:

['[CLS] Someone walks over to the radio. [SEP] Someone hands her another phone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] Someone walks over to the radio. [SEP] Someone takes the drink, then holds it. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] Someone walks over to the radio. [SEP] Someone looks off then looks at someone. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] Someone walks over to the radio. [SEP] Someone stares blearily down at the floor. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']

In [25]:

show_one(datasets["train"][8])

Context: Someone walks over to the radio.
  A - Someone hands her another phone.
  B - Someone takes the drink, then holds it.
  C - Someone looks off then looks at someone.
  D - Someone stares blearily down at the floor.

Ground truth: option D

All good! Now we can use this collator as a collation function for our dataset.

Next, we convert our datasets to tf.data.Dataset, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level Dataset.to_tf_dataset() method, or we can use Model.prepare_tf_dataset(). The main difference between these two is that the Model method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself.

In [26]:

data_collator = DataCollatorForMultipleChoice(tokenizer)

train_set = model.prepare_tf_dataset(
    encoded_datasets["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

validation_set = model.prepare_tf_dataset(
    encoded_datasets["validation"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

train_set

Out[26]:

<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(16, 4, None), dtype=tf.int64, name=None), 'token_type_ids': TensorSpec(shape=(16, 4, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, 4, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(16,), dtype=tf.int64, name=None))>

As we can see, our dataset will output a 2-tuple where the first element is a dict containing input_ids, token_type_ids and attention_mask, and the second element is the label. This is exactly what we want for our model!

Now we can compile our model. First, we specify an optimizer. Using the create_optimizer function we can get a nice AdamW optimizer with weight decay and a learning rate decay schedule set up for free - but to compute that schedule, it needs to know how long training will take.

In [27]:

from transformers import create_optimizer

total_train_steps = (len(encoded_datasets["train"]) // batch_size) * num_train_epochs

optimizer, schedule = create_optimizer(
    init_lr=learning_rate, num_warmup_steps=0, num_train_steps=total_train_steps
)

Note that most Transformers models compute loss internally, so we actually don't have to specify anything there! You can of course set your own loss function if you want, but by default our models will choose the 'obvious' loss that matches their task, such as cross-entropy in the case of language modelling. The built-in loss will also correctly handle things like masking the loss on padding tokens, or unlabelled tokens in the case of masked language modelling, so we recommend using it unless you're an advanced user!

In addition, because the outputs and loss for this model class are quite straightforward, we can use built-in Keras metrics - these are liable to misbehave in other contexts (for example, they don't know about the masking in masked language modelling) but work well here.

In some of our other examples, we use jit_compile to compile the model with XLA. In this case, we should be careful about that - because our inputs have variable sequence lengths, we may end up having to do a new XLA compilation for each possible length, because XLA compilation expects a static input shape! For small datasets, this will probably result in spending more time on XLA compilation than actually training, which isn't very helpful.

If you really want to use XLA without these problems (for example, if you're training on TPU), you can create a tokenizer with padding="max_length". This will pad all of your samples to the same length, ensuring that a single XLA compilation will suffice for your entire dataset. Note that depending on the nature of your dataset, this may result in a lot of wasted computation on padding tokens!

In [28]:

import tensorflow as tf

model.compile(
    optimizer=optimizer,
    metrics=["accuracy"],
)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.

Now we can train our model. We can also add a callback to sync up our model with the Hub - this allows us to resume training from other machines and even test the model's inference quality midway through training! Make sure to change the username if you do. If you don't want to do this, simply remove the callbacks argument in the call to fit().

In [29]:

from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./mc_model_save/logs")

push_to_hub_callback = PushToHubCallback(
    output_dir="./mc_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)

callbacks = [tensorboard_callback, push_to_hub_callback]

model.fit(
    train_set,
    validation_data=validation_set,
    epochs=num_train_epochs,
    callbacks=callbacks,
)

/home/matt/PycharmProjects/notebooks/examples/mc_model_save is already a clone of https://huggingface.co/Rocketknight1/bert-base-cased-finetuned-swag. Make sure you pull the latest changes with `repo.git_pull()`.

Epoch 1/2
4596/4596 [==============================] - ETA: 0s - loss: 0.8709 - accuracy: 0.6465

Several commits (2) will be pushed upstream.

4596/4596 [==============================] - 827s 178ms/step - loss: 0.8709 - accuracy: 0.6465 - val_loss: 0.6167 - val_accuracy: 0.7590
Epoch 2/2
4596/4596 [==============================] - 820s 178ms/step - loss: 0.3868 - accuracy: 0.8568 - val_loss: 0.6014 - val_accuracy: 0.7795

Out[29]:

<keras.callbacks.History at 0x7fa5b8519780>

If you used the callback above, you can now share this model with all your friends, family or favorite pets: they can all load it with the identifier "your-username/the-name-you-picked" so for instance:

from transformers import TFAutoModelForMultipleChoice

model = TFAutoModelForMultipleChoice.from_pretrained("your-username/my-awesome-model")

Inference¶

Now we've trained our model, let's see how we could load it and use it to answer questions in future! First, let's load it from the hub. This means we can resume the code from here without needing to rerun everything above every time.

In [31]:

from transformers import AutoTokenizer, TFAutoModelForMultipleChoice

# You can, of course, use your own username and model name here 
# once you've pushed your model using the code above!
checkpoint = "Rocketknight1/bert-base-cased-finetuned-swag"
model = TFAutoModelForMultipleChoice.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/413M [00:00<?, ?B/s]

Some layers from the model checkpoint at Rocketknight1/bert-base-cased-finetuned-swag were not used when initializing TFBertForMultipleChoice: ['dropout_37']
- This IS expected if you are initializing TFBertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForMultipleChoice were initialized from the model checkpoint at Rocketknight1/bert-base-cased-finetuned-swag.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMultipleChoice for predictions without further training.

Downloading tokenizer_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/653k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Now let's see how to use this model for inference. The SWAG task we trained on is a commonsense inference benchmark, where we ask the model to indicate which of four completions of a sentence is realistic and makes sense in context. Let's use a sample input from SWAG and see how we can get predictions for it.

In [36]:

input_start = "Members of the procession walk down the street holding small horn brass instruments. A drum line"
endings = [
    'passes by walking down the street playing their instruments.',
    'has heard approaching them.',
    "arrives and they're outside dancing and asleep.",
    'turns the lead singer watches the performance.',
]
full_sentences = [f"{input_start} {ending}" for ending in endings]

Now we tokenize this input. Note that our inputs need to be reshaped a little - multiple choice models expect inputs to have the shape (num_samples, num_choices, num_tokens) - this means we will need to add a sample/batch dimension of length 1.

In [44]:

import numpy as np

tokenized = tokenizer(full_sentences, padding="longest", return_tensors="np")
tokenized = {key: np.expand_dims(array, 0) for key, array in tokenized.items()}

And now we run these inputs through our model and see what it guesses!

In [50]:

import tensorflow as tf

outputs = model(tokenized).logits
answer = np.argmax(outputs)

print(f"The answer is choice {answer}: {endings[answer]}")

The answer is choice 0: passes by walking down the street playing their instruments.

In [ ]: