This notebook regroups the code sample of the video below, which is a part of the Hugging Face course.
#@title
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ROxrFOEbsQE?rel=0&controls=0&showinfo=0" frameborder="0" allowfullscreen></iframe>')
Install the Transformers and Datasets libraries to run this notebook.
! pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this.",
]
tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
print(ids[0])
print(ids[1])
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012] [1045, 5223, 2023, 1012]
import tensorflow as tf
ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 1012]]
input_ids = tf.constant(ids)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-5c1e8b893878> in <module> 4 [1045, 5223, 2023, 1012]] 5 ----> 6 input_ids = tf.constant(ids) ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name) 263 """ 264 return _constant_impl(value, dtype, shape, name, verify_shape=False, --> 265 allow_broadcast=True) 266 267 ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast) 274 with trace.Trace("tf.constant"): 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) --> 276 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 277 278 g = ops.get_default_graph() ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 299 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape): 300 """Implementation of eager constant.""" --> 301 t = convert_to_eager_tensor(value, ctx, dtype) 302 if shape is None: 303 return t ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype) 96 dtype = dtypes.as_dtype(dtype).as_datatype_enum 97 ctx.ensure_initialized() ---> 98 return ops.EagerTensor(value, ctx.device_name, dtype) 99 100 ValueError: Can't convert non-rectangular Python sequence to Tensor.
import tensorflow as tf
ids = [[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
input_ids = tf.constant(ids)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.pad_token_id
0
from transformers import TFAutoModelForSequenceClassification
ids1 = tf.constant(
[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]]
)
ids2 = tf.constant([[1045, 5223, 2023, 1012]])
all_ids = tf.constant(
[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
print(model(ids1).logits)
print(model(ids2).logits)
print(model(all_ids).logits)
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification. All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
tf.Tensor([[-2.7276204 2.8789372]], shape=(1, 2), dtype=float32) tf.Tensor([[ 3.9497483 -3.1357408]], shape=(1, 2), dtype=float32) tf.Tensor( [[-2.7276206 2.878937 ] [ 1.5444432 -1.3998369]], shape=(2, 2), dtype=float32)
all_ids = tf.constant(
[[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012],
[1045, 5223, 2023, 1012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
)
attention_mask = tf.constant(
[[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
output1 = model(ids1)
output2 = model(ids2)
print(output1.logits)
print(output2.logits)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19'] - This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tf.Tensor([[-2.7276204 2.8789372]], shape=(1, 2), dtype=float32) tf.Tensor([[ 3.9497483 -3.1357408]], shape=(1, 2), dtype=float32)
output = model(all_ids, attention_mask=attention_mask)
print(output.logits)
tf.Tensor( [[-2.7276206 2.878937 ] [ 3.9497476 -3.1357408]], shape=(2, 2), dtype=float32)
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this.",
]
print(tokenizer(sentences, padding=True))
{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 1045, 5223, 2023, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}