Installez les bibliothรจques ๐ค Datasets et ๐ค Transformers pour exรฉcuter ce notebook.
!pip install datasets transformers[sentencepiece]
!apt install git-lfs
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting datasets Downloading datasets-2.6.1-py3-none-any.whl (441 kB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 441 kB 5.2 MB/s Collecting transformers[sentencepiece] Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 5.3 MB 38.1 MB/s Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1) Collecting responses<0.19 Downloading responses-0.18.0-py3-none-any.whl (38 kB) Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.6) Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.8.2) Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.3) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.13.0) Collecting xxhash Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 212 kB 45.6 MB/s Collecting multiprocess Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 115 kB 47.7 MB/s Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0) Collecting huggingface-hub<1.0.0,>=0.2.0 Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 163 kB 24.9 MB/s Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.64.1) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0) Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.8.1) Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (22.1.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2) Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.1.1) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.1) Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.2.0->datasets) (3.8.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.9) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.24.3) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2022.9.24) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10) Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 127 kB 14.0 MB/s Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.9.0) Collecting multiprocess Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 115 kB 37.5 MB/s Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.4) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (2022.6.2) Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 7.6 MB 39.5 MB/s Collecting sentencepiece!=0.1.92,>=0.1.91 Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) |โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1.3 MB 44.7 MB/s Requirement already satisfied: protobuf<=3.20.2 in /usr/local/lib/python3.7/dist-packages (from transformers[sentencepiece]) (3.17.3) Installing collected packages: urllib3, tokenizers, huggingface-hub, xxhash, transformers, sentencepiece, responses, multiprocess, datasets Attempting uninstall: urllib3 Found existing installation: urllib3 1.24.3 Uninstalling urllib3-1.24.3: Successfully uninstalled urllib3-1.24.3 Successfully installed datasets-2.6.1 huggingface-hub-0.10.1 multiprocess-0.70.13 responses-0.18.0 sentencepiece-0.1.97 tokenizers-0.13.1 transformers-4.23.1 urllib3-1.25.11 xxhash-3.1.0 Reading package lists... Done Building dependency tree Reading state information... Done git-lfs is already the newest version (2.3.4-1). The following package was automatically installed and is no longer required: libnvidia-common-460 Use 'apt autoremove' to remove it. 0 upgraded, 0 newly installed, 0 to remove and 27 not upgraded.
Vous aurez besoin de configurer git, adaptez votre email et votre nom dans la cellule suivante.
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"
Vous devrez รฉgalement รชtre connectรฉ au Hub d'Hugging Face. Exรฉcutez ce qui suit et entrez vos informations d'identification.
from huggingface_hub import notebook_login
notebook_login()
from datasets import load_dataset, load_metric
raw_datasets = load_dataset("kde4", lang1="en", lang2="fr")
raw_datasets
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets
split_datasets["validation"] = split_datasets.pop("test")
split_datasets["train"][1]["translation"]
from transformers import pipeline
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")
split_datasets["train"][172]["translation"]
translator(
"Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)
from transformers import AutoTokenizer
model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="tf")
en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]
inputs = tokenizer(en_sentence)
with tokenizer.as_target_tokenizer():
targets = tokenizer(fr_sentence)
wrong_targets = tokenizer(fr_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(targets["input_ids"]))
max_input_length = 128
max_target_length = 128
def preprocess_function(examples):
inputs = [ex["en"] for ex in examples["translation"]]
targets = [ex["fr"] for ex in examples["translation"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Configurer le tokenizer pour les cibles
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = split_datasets.map(
preprocess_function,
batched=True,
remove_columns=split_datasets["train"].column_names,
)
from transformers import TFAutoModelForSeq2SeqLM
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()
batch["labels"]
batch["decoder_input_ids"]
for i in range(1, 3):
print(tokenized_datasets["train"][i]["labels"])
tf_train_dataset = model.prepare_tf_dataset(
tokenized_datasets["train"],
collate_fn=data_collator,
shuffle=True,
batch_size=32,
)
tf_eval_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
collate_fn=data_collator,
shuffle=False,
batch_size=16,
)
!pip install sacrebleu
from datasets import load_metric
metric = load_metric("sacrebleu")
predictions = [
"This plugin lets you translate web pages between several languages automatically."
]
references = [
[
"This plugin allows you to automatically translate web pages between several languages."
]
]
metric.compute(predictions=predictions, references=references)
predictions = ["This This This This"]
references = [
[
"This plugin allows you to automatically translate web pages between several languages."
]
]
metric.compute(predictions=predictions, references=references)
predictions = ["This plugin"]
references = [
[
"This plugin allows you to automatically translate web pages between several languages."
]
]
metric.compute(predictions=predictions, references=references)
import numpy as np
import tensorflow as tf
from tqdm import tqdm
generation_data_collator = DataCollatorForSeq2Seq(
tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128
)
tf_generate_dataset = model.prepare_tf_dataset(
tokenized_datasets["validation"],
collate_fn=generation_data_collator,
shuffle=False,
batch_size=8,
)
@tf.function(jit_compile=True)
def generate_with_xla(batch):
return model.generate(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
max_new_tokens=128,
)
def compute_metrics():
all_preds = []
all_labels = []
for batch, labels in tqdm(tf_generate_dataset):
predictions = generate_with_xla(batch)
for batch in tf_generate_dataset:
predictions = model.generate(
input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
)
labels = labels.numpy()
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
decoded_preds = [pred.strip() for pred in decoded_preds]
decoded_labels = [[label.strip()] for label in decoded_labels]
all_preds.extend(decoded_preds)
all_labels.extend(decoded_labels)
result = metric.compute(predictions=all_preds, references=all_labels)
return {"bleu": result["score"]}
from huggingface_hub import notebook_login
notebook_login()
print(compute_metrics())
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf
# Le nombre d'รฉtapes d'entraรฎnement est le nombre d'รฉchantillons dans le jeu de donnรฉes, divisรฉ par la taille du batch puis multipliรฉ
# par le nombre total d'รฉpoques. Notez que le jeu de donnรฉes tf_train_dataset est ici un lot de donnรฉes tf.data.Dataset,
# pas le jeu de donnรฉes original Hugging Face, donc son len() est dรฉjร num_samples // batch_size.
num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs
optimizer, schedule = create_optimizer(
init_lr=5e-5,
num_warmup_steps=0,
num_train_steps=num_train_steps,
weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)
# Entraรฎner en mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")
from transformers.keras_callbacks import PushToHubCallback
callback = PushToHubCallback(
output_dir="marian-finetuned-kde4-en-to-fr", tokenizer=tokenizer
)
model.fit(
tf_train_dataset,
validation_data=tf_eval_dataset,
callbacks=[callback],
epochs=num_epochs,
)
print(compute_metrics())
from transformers import pipeline
# Remplacer par votre propre checkpoint
model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")
translator(
"Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)