# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git
Get up and running with 🤗 Transformers! Start using the pipeline() for rapid inference, and quickly load a pretrained model and tokenizer with an AutoClass to solve your text, vision or audio task.
All code examples presented in the documentation have a toggle on the top left for PyTorch and TensorFlow. If not, the code is expected to work for both backends without any change.
pipeline() is the easiest way to use a pretrained model for a given task.
#@title
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM?rel=0&controls=0&showinfo=0" frameborder="0" allowfullscreen></iframe>')
The pipeline() supports many common tasks out-of-the-box:
Text:
Image:
Audio:
For more details about the pipeline() and associated tasks, refer to the documentation here.
In the following example, you will use the pipeline() for sentiment analysis.
Install the following dependencies if you haven't already:
pip install torch
pip install tensorflow
Import pipeline() and specify the task you want to complete:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
The pipeline downloads and caches a default pretrained model and tokenizer for sentiment analysis. Now you can use the classifier
on your target text:
classifier("We are very happy to show you the 🤗 Transformers library.")
[{"label": "POSITIVE", "score": 0.9998}]
For more than one sentence, pass a list of sentences to the pipeline() which returns a list of dictionaries:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9998 label: NEGATIVE, with score: 0.5309
The pipeline() can also iterate over an entire dataset. Start by installing the 🤗 Datasets library:
pip install datasets
Create a pipeline() with the task you want to solve for and the model you want to use. Set the device
parameter to 0
to place the tensors on a CUDA device:
from transformers import pipeline
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
Next, load a dataset (see the 🤗 Datasets Quick Start for more details) you'd like to iterate over. For example, let's load the SUPERB dataset:
import datasets
dataset = datasets.load_dataset("superb", name="asr", split="test")
Now you can iterate over the dataset with the pipeline. KeyDataset
retrieves the item in the dictionary returned by the dataset:
from transformers.pipelines.base import KeyDataset
from tqdm.auto import tqdm
for out in tqdm(speech_recognizer(KeyDataset(dataset, "file"))):
print(out)
{"text": "HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE"}
The pipeline() can accommodate any model from the Model Hub, making it easy to adapt the pipeline() for other use-cases. For example, if you'd like a model capable of handling French text, use the tags on the Model Hub to filter for an appropriate model. The top filtered result returns a multilingual BERT model fine-tuned for sentiment analysis. Great, let's use this model!
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
Use the AutoModelForSequenceClassification and ['AutoTokenizer'] to load the pretrained model and it's associated tokenizer (more on an AutoClass
below):
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Then you can specify the model and tokenizer in the pipeline(), and apply the classifier
on your target text:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
[{"label": "5 stars", "score": 0.7272651791572571}]
If you can't find a model for your use-case, you will need to fine-tune a pretrained model on your data. Take a look at our fine-tuning tutorial to learn how. Finally, after you've fine-tuned your pretrained model, please consider sharing it (see tutorial here) with the community on the Model Hub to democratize NLP for everyone! 🤗
#@title
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/AhChOFRegn4?rel=0&controls=0&showinfo=0" frameborder="0" allowfullscreen></iframe>')
Under the hood, the AutoModelForSequenceClassification and AutoTokenizer classes work together to power the pipeline(). An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from it's name or path. You only need to select the appropriate AutoClass
for your task and it's associated tokenizer with AutoTokenizer.
Let's return to our example and see how you can use the AutoClass
to replicate the results of the pipeline().
A tokenizer is responsible for preprocessing text into a format that is understandable to the model. First, the tokenizer will split the text into words called tokens. There are multiple rules that govern the tokenization process, including how to split a word and at what level (learn more about tokenization here). The most important thing to remember though is you need to instantiate the tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.
Load a tokenizer with AutoTokenizer:
from transformers import AutoTokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
Next, the tokenizer converts the tokens into numbers in order to construct a tensor as input to the model. This is known as the model's vocabulary.
Pass your text to the tokenizer:
encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
print(encoding)
{"input_ids": [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The tokenizer will return a dictionary containing:
Just like the pipeline(), the tokenizer will accept a list of inputs. In addition, the tokenizer can also pad and truncate the text to return a batch with uniform length:
pt_batch = tokenizer(
["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
tf_batch = tokenizer(
["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
padding=True,
truncation=True,
max_length=512,
return_tensors="tf",
)
Read the preprocessing tutorial for more details about tokenization.
🤗 Transformers provides a simple and unified way to load pretrained instances. This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. Since you are doing text - or sequence - classification, load AutoModelForSequenceClassification. The TensorFlow equivalent is simply TFAutoModelForSequenceClassification:
from transformers import AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
from transformers import TFAutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
See the task summary for which AutoModel class to use for which task.
Now you can pass your preprocessed batch of inputs directly to the model. If you are using a PyTorch model, unpack the dictionary by adding **
. For TensorFlow models, pass the dictionary keys directly to the tensors:
pt_outputs = pt_model(**pt_batch)
tf_outputs = tf_model(tf_batch)
The model outputs the final activations in the logits
attribute. Apply the softmax function to the logits
to retrieve the probabilities:
from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)
tensor([[2.2043e-04, 9.9978e-01], [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
print(tf_predictions)
tf.Tensor( [[2.2043e-04 9.9978e-01] [5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
All 🤗 Transformers models (PyTorch or TensorFlow) outputs the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.
Models are a standard torch.nn.Module
or a tf.keras.Model
so you can use them in your usual training loop. However, to make things easier, 🤗 Transformers provides a Trainer class for PyTorch that adds functionality for distributed training, mixed precision, and more. For TensorFlow, you can use the fit
method from Keras. Refer to the training tutorial for more details.
🤗 Transformers model outputs are special dataclasses so their attributes are autocompleted in an IDE.
The model outputs also behave like a tuple or a dictionary (e.g., you can index with an integer, a slice or a string) in which case the attributes that are None
are ignored.
Once your model is fine-tuned, you can save it with its tokenizer using PreTrainedModel.save_pretrained():
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)
tf_save_directory = "./tf_save_pretrained"
tokenizer.save_pretrained(tf_save_directory)
tf_model.save_pretrained(tf_save_directory)
When you are ready to use the model again, reload it with PreTrainedModel.from_pretrained():
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")
One particularly cool 🤗 Transformers feature is the ability to save a model and reload it as either a PyTorch or TensorFlow model. The from_pt
or from_tf
parameter can convert the model from one framework to the other:
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
from transformers import TFAutoModel
tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True)