This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.

In [ ]:

#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/A5IWIxsHLUw?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Out[ ]:

Install the Transformers and Datasets libraries to run this notebook.

In [ ]:

! pip install datasets transformers[sentencepiece]

You will need an authentication token with your Hugging Face credentials to use the push_to_hub method. Execute huggingface-cli login in your terminal or by uncommenting the following cell:

In [ ]:

# !huggingface-cli login

In [ ]:

import numpy as np

from datasets import load_dataset, load_metric
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)

In [ ]:

checkpoint = "bert-base-cased"

In [ ]:

raw_datasets = load_dataset("glue", "mrpc")

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

training_args = TrainingArguments(
    "finetuned-bert-mrpc",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    log_level="error",
    push_to_hub=True,
    push_to_hub_model_id="finetuned-bert-mrpc",
    # push_to_hub_organization="huggingface",
    # push_to_hub_token="my_token",
)

data_collator = DataCollatorWithPadding(tokenizer)

metric = load_metric("glue", "mrpc")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Reusing dataset glue (/home/sgugger/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-8174fd92eed0af98.arrow
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-8c99fb059544bc96.arrow
Loading cached processed dataset at /home/sgugger/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e625eb72bcf1ae1f.arrow
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In [ ]:

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [ ]:

trainer.train()

[690/690 01:13, Epoch 3/3]

Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.573500	0.475627	0.774510	0.835714
2	0.383800	0.452076	0.821078	0.878939
3	0.233600	0.485343	0.850490	0.897133

Out[ ]:

TrainOutput(global_step=690, training_loss=0.39697496718254643, metrics={'train_runtime': 74.2937, 'train_samples_per_second': 148.115, 'train_steps_per_second': 9.287, 'total_flos': 563360051116800.0, 'train_loss': 0.39697496718254643, 'epoch': 3.0})

Push to hub from the Trainer directly¶

The Trainer has a new method to directly upload the model, tokenizer and model configuration in a repo on the Hub. It will even auto-generate a model card draft using the hyperparameters and evaluation results!

In [ ]:

trainer.push_to_hub()

Out[ ]:

'https://huggingface.co/sgugger/finetuned-bert/commit/12c9aaaadf60e2c48e9419524f3170f445a120d6'

If you are using your own training loop, you can push the model and tokenizer separately (and you will have to write the model card yourself):

In [ ]:

# model.push_to_hub("finetuned-bert-mrpc")
# tokenizer.push_to_hub("finetuned-bert-mrpc")

You can load your model from anywhere using from_pretrained!¶

In [ ]:

from transformers import AutoModelForSequenceClassification

model_name = "sgugger/finetuned-bert-mrpc"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=671.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433336585.0, style=ProgressStyle(descri…

You can use your model in a pipeline!¶

In [ ]:

from transformers import pipeline

classifier = pipeline("text-classification", model=model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435816.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=284.0, style=ProgressStyle(description_…

In [ ]:

classifier("My name is Sylvain. [SEP] My name is Lysandre")

Out[ ]:

[{'label': 'LABEL_0', 'score': 0.7789641618728638}]

Updating a problematic file is super easy!¶

In [ ]:

model.config.label2id = {"not equivalent": 0, "equivalent": 1}

In [ ]:

model.config.id2label = {0: "not equivalent", 1: "equivalent"}

In [ ]:

model.config.push_to_hub("finetuned-bert-mrpc")

In [ ]:

classifier = pipeline("text-classification", model=model_name)

classifier("My name is Sylvain. [SEP] My name is Lysandre")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=814.0, style=ProgressStyle(description_…

Out[ ]:

[{'label': 'not equivalent', 'score': 0.7789641618728638}]

In [ ]: