This notebook is an adaptation of the notebook question_answering.ipynb and of the script run_qa.py of Hugging Face (HF) for fine-tuning a (transformer) Masked Language Model (MLM) like BERT on the QA task with the Portuguese Squad 1.1 dataset.
In order to speed up the fine-tuning of the model on only one GPU, the library DeepSpeed is used by applying the configuration provided by HF in the notebook transformers + deepspeed CLI.
Note: the paragraph about Causal language modeling (CLM) is not included in this notebook, and all the non necessary code about Masked Model Language (MLM) has been deleted from the original notebook.
If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets, and DeepSpeed. Uncomment the following cells and run it.
Pytorch
# !pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
DeepSpeed
# !pip install git+https://github.com/microsoft/deepspeed
Datasets, Tokenizers, Transformers
# git clone https://github.com/huggingface/transformers
# cd transformers
# # examples change a lot so let's pick a sha that we know this notebook will work with
# # comment out/remove the next line if you want the master
# # git checkout d2753dcbec712350
# pip install -e .
# pip install -r examples/pytorch/translation/requirements.txt
In this notebook's folder, you need to create symbolic links to 3 files in the transformers folder you just installed.
#!ln -s ~/transformers/examples/pytorch/question-answering/run_qa.py
#!ln -s ~/transformers/examples/pytorch/question-answering/trainer_qa.py
#!ln -s ~/transformers/examples/pytorch/question-answering/utils_qa.py
Let's check our installation.
import sys; print('python:',sys.version)
import pathlib
from pathlib import Path
import torch; print('Pytorch:',torch.__version__)
import transformers; print('transformers:',transformers.__version__)
import tokenizers; print('tokenizers:',tokenizers.__version__)
import datasets; print('datasets:',datasets.__version__)
import deepspeed; print('deepspeed:',deepspeed.__version__)
# Versions installed:
# python: 3.8.10 (default, Jun 4 2021, 15:09:15)
# [GCC 7.5.0]
# Pytorch: 1.8.1+cu111
# transformers: 4.7.0.dev0
# tokenizers: 0.10.3
# datasets: 1.8.0
# deepspeed: 0.4.1+fa7921e
python: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0] Pytorch: 1.8.1+cu111 transformers: 4.7.0.dev0 tokenizers: 0.10.3 datasets: 1.8.0 deepspeed: 0.4.1+fa7921e
If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.
You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs here.
In this notebook, we will see how to fine-tune one of the 🤗 Transformers model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the Trainer
API to fine-tune a model on it.
Note: This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.
This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the Model Hub as long as that model has a version with a token classification head and a fast tokenizer (check on this table if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
batch_size = 16
model_name_or_path = "neuralmind/bert-large-portuguese-cased"
For our example here, we'll use the Portuguese Squad 1.1 dataset which is a translation of the English SQUAD dataset. The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the Datasets documentation on how to load them), it might need some adjustments in the names of the columns used.
dataset_name = "squad11pt"
# %%time
# if dataset_name == "squad11pt":
# # create dataset folder
# root = Path.cwd()
# path_to_dataset = root.parent/'data'/dataset_name
# path_to_dataset.mkdir(parents=True, exist_ok=True)
# # Get dataset SQUAD in Portuguese
# %cd {path_to_dataset}
# !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn" -O squad-pt.tar.gz && rm -rf /tmp/cookies.txt
# # unzip
# !tar -xvf squad-pt.tar.gz
# # Get the train and validation json file in the HF script format
# # inspiration: file squad.py at https://github.com/huggingface/datasets/tree/master/datasets/squad
# import json
# files = ['squad-train-v1.1.json','squad-dev-v1.1.json']
# for file in files:
# # Opening JSON file & returns JSON object as a dictionary
# f = open(file, encoding="utf-8")
# data = json.load(f)
# # Iterating through the json list
# entry_list = list()
# id_list = list()
# for row in data['data']:
# title = row['title']
# for paragraph in row['paragraphs']:
# context = paragraph['context']
# for qa in paragraph['qas']:
# entry = {}
# qa_id = qa['id']
# question = qa['question']
# answers = qa['answers']
# entry['id'] = qa_id
# entry['title'] = title.strip()
# entry['context'] = context.strip()
# entry['question'] = question.strip()
# answer_starts = [answer["answer_start"] for answer in answers]
# answer_texts = [answer["text"].strip() for answer in answers]
# entry['answers'] = {}
# entry['answers']['answer_start'] = answer_starts
# entry['answers']['text'] = answer_texts
# entry_list.append(entry)
# reverse_entry_list = entry_list[::-1]
# # for entries with same id, keep only last one (corrected texts by the group Deep Learning Brasil)
# unique_ids_list = list()
# unique_entry_list = list()
# for entry in reverse_entry_list:
# qa_id = entry['id']
# if qa_id not in unique_ids_list:
# unique_ids_list.append(qa_id)
# unique_entry_list.append(entry)
# # Closing file
# f.close()
# new_dict = {}
# new_dict['data'] = unique_entry_list
# file_name = 'pt_' + str(file)
# with open(file_name, 'w') as json_file:
# json.dump(new_dict, json_file)
# %cd {root}
We will use the 🤗 Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset
and load_metric
.
from datasets import load_dataset, load_metric
if dataset_name == "squad11pt":
# dataset folder
root = Path.cwd()
path_to_dataset = root.parent/'data'/dataset_name
# paths to files
train_file = str(path_to_dataset/'pt_squad-train-v1.1.json')
validation_file = str(path_to_dataset/'pt_squad-dev-v1.1.json')
datasets = load_dataset('json',
data_files={'train': train_file, \
'validation': validation_file, \
},
field='data')
The datasets
object itself is DatasetDict
, which contains one key for the training, validation and test set.
datasets
DatasetDict({ train: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 87510 }) validation: Dataset({ features: ['answers', 'context', 'id', 'question', 'title'], num_rows: 10570 }) })
To access an actual element, you need to select a split first, then give an index:
datasets["train"][0]
{'id': '5735d259012e2f140011a0a1', 'title': 'Kathmandu', 'context': 'A Cidade Metropolitana de Catmandu (KMC), a fim de promover as relações internacionais, criou uma Secretaria de Relações Internacionais (IRC). O primeiro relacionamento internacional da KMC foi estabelecido em 1975 com a cidade de Eugene, Oregon, Estados Unidos. Essa atividade foi aprimorada ainda mais com o estabelecimento de relações formais com outras 8 cidades: Cidade de Motsumoto, Japão, Rochester, EUA, Yangon (antiga Rangum) de Mianmar, Xian da República Popular da China, Minsk da Bielorrússia e Pyongyang de República Democrática da Coréia. O esforço constante da KMC é aprimorar sua interação com os países da SAARC, outras agências internacionais e muitas outras grandes cidades do mundo para alcançar melhores programas de gestão urbana e desenvolvimento para Katmandu.', 'question': 'De que KMC é um inicialismo?', 'answers': {'answer_start': [2], 'text': ['Cidade Metropolitana de Catmandu']}}
We can see the answers are indicated by their start position in the text (here at character 2) and their full text, which is a substring of the context as we mentioned above.
Let's setup the DeepSpeed
configuration.
To use a LR Linear Decay after warmup as scheduler (equivalent of the one by default in the HF Trainer), we changed WarmupLR
to WarmupDecayLR
in the DeepSpeed configuration file, and kept a copy of the scheduler initial code here:
# the lr stays constant after the warmup (this is not equivalent to teh default scheduler of HF which is Linear)
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
%%bash
cat <<'EOT' > ds_config_zero2.json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
EOT
Compared to ZeRO-2, the ZeRO-3
configuration allows to train larger models but also for a longer training time. For this reason, we will not be using the ZeRO-3
configuration.
%%bash
cat <<'EOT' > ds_config_zero3.json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
EOT
Let's setup all the training arguments needed by the script run_qa.py
.
num_gpus = 1 # run the script on only one gpu
gpu = 0 # select the gpu
# setup the training argument
do_train = True # False
do_eval = True
if dataset_name == "squad11pt":
# dataset folder
root = Path.cwd()
path_to_dataset = root.parent/'data'/dataset_name
# paths to files
train_file = str(path_to_dataset/'pt_squad-train-v1.1.json')
validation_file = str(path_to_dataset/'pt_squad-dev-v1.1.json')
# if you want to test the trainer, set up the following variables
max_train_samples = 200 # None
max_eval_samples = 50 # None
# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
max_seq_length = 384
# Whether to pad all samples to `max_seq_length`.
# If False, will pad the samples dynamically when batching to the maximum length in the batch
# (which can be faster on GPU but will be slower on TPU).
pad_to_max_length = True
# If true, some of the examples do not have an answer.
version_2_with_negative = False
# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128
# The total number of n-best predictions to generate when looking for an answer.
n_best_size = 20
# The maximum length of an answer that can be generated. This is needed because the start
# and end predictions are not conditioned on one another.
max_answer_length = 30
If you keep the value 1e-8 for adam_epsilon
in fp16
mode, it is zero. The first non-zero value is 1e-7 in this mode. After some testing, we found that adam_epsilon = 1e-4
gives the best results.
We use the ZeRO-2
mode for DeepSpeed
.
# epochs, bs, GA
evaluation_strategy = "epoch" # no
BS = batch_size
gradient_accumulation_steps = 1
# optimizer (AdamW)
learning_rate = 5e-5
weight_decay = 0.01 # 0.0
adam_beta1 = 0.9
adam_beta2 = 0.999
adam_epsilon = 1e-4 # 1e-08
# epochs
num_train_epochs = 3.
# scheduler
lr_scheduler_type = 'linear'
warmup_ratio = 0.0
warmup_steps = 0
# logs
logging_strategy = "steps"
logging_first_step = True # False
logging_steps = 500 # if strategy = "steps"
eval_steps = logging_steps # logging_steps
# checkpoints
save_strategy = "epoch" # steps
save_steps = 500 # if save_strategy = "steps"
save_total_limit = 1 # None
# no cuda, seed
no_cuda = False
seed = 42
# fp16
fp16 = True # False
fp16_opt_level = 'O1'
fp16_backend = "auto"
fp16_full_eval = False
# bar
disable_tqdm = False # True
remove_unused_columns = True
#label_names (List[str], optional)
# best model
load_best_model_at_end = True # False
metric_for_best_model = "eval_f1"
greater_is_better = True
# deepspeed
zero = 2
if zero == 2:
deepspeed_config = "ds_config_zero2.json"
elif zero == 3:
deepspeed_config = "ds_config_zero3.json"
# folder for training outputs
outputs = model_name_or_path.replace('/','-') + '-' + dataset_name \
+ '_wd' + str(weight_decay) + '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '-lr' + str(learning_rate)
path_to_outputs = root/'models_outputs'/outputs
# subfolder for model outputs
output_dir = path_to_outputs/'output_dir'
overwrite_output_dir = True # False
# logs
logging_dir = path_to_outputs/'logging_dir'
This is needed to launch the deepspeed
command in our server configuration.
import os
PATH = os.getenv('PATH')
%env PATH=/mnt/home/xxxx/anaconda3/envs/aaaa/bin:$PATH
# xxxx is the folder name where anaconda was installed
# aaaa is the virtual ambiente name within this notebook is run
The magic command %env
corresponds to export
in linux. It allows to setup the values of all arguments of the script run_qa.py
.
Note: as we noticed that the script runs without environment variables in this notebook but with local ones, we do not use this magic command.
!rm -r {output_dir}
Now, we can launch the training :-)
# copy/paste/uncomment the 2 following lines in the following cell if you want to limit the number of data (useful for testing)
# --max_train_samples $max_train_samples \
# --max_eval_samples $max_eval_samples \
%%time
# !deepspeed --num_gpus=$num_gpus run_qa.py \
!deepspeed --include localhost:$gpu run_qa.py \
--model_name_or_path $model_name_or_path \
--train_file $train_file \
--do_train $do_train \
--do_eval $do_eval \
--validation_file $validation_file \
--max_seq_length $max_seq_length \
--pad_to_max_length $pad_to_max_length \
--version_2_with_negative $version_2_with_negative \
--doc_stride $doc_stride \
--n_best_size $n_best_size \
--max_answer_length $max_answer_length \
--output_dir $output_dir \
--overwrite_output_dir $overwrite_output_dir \
--evaluation_strategy $evaluation_strategy \
--per_device_train_batch_size $batch_size \
--per_device_eval_batch_size $batch_size \
--gradient_accumulation_steps $gradient_accumulation_steps \
--learning_rate $learning_rate \
--weight_decay $weight_decay \
--adam_beta1 $adam_beta1 \
--adam_beta2 $adam_beta2 \
--adam_epsilon $adam_epsilon \
--num_train_epochs $num_train_epochs \
--warmup_ratio $warmup_ratio \
--warmup_steps $warmup_steps \
--logging_dir $logging_dir \
--logging_strategy $logging_strategy \
--logging_first_step $logging_first_step \
--logging_steps $logging_steps \
--eval_steps $eval_steps \
--save_strategy $save_strategy \
--save_steps $save_steps \
--save_total_limit $save_total_limit \
--no_cuda $no_cuda \
--seed $seed \
--fp16 $fp16 \
--fp16_opt_level $fp16_opt_level \
--fp16_backend $fp16_backend \
--fp16_full_eval $fp16_full_eval \
--disable_tqdm $disable_tqdm \
--remove_unused_columns $remove_unused_columns \
--load_best_model_at_end $load_best_model_at_end \
--metric_for_best_model $metric_for_best_model \
--greater_is_better $greater_is_better \
--deepspeed $deepspeed_config
Results
[INFO|trainer_pt_utils.py:908] 2021-06-18 05:32:34,020 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >> epoch = 3.0
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >> eval_exact_match = 72.6774
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >> eval_f1 = 84.4315
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >> eval_samples = 10917
CPU times: user 5min 5s, sys: 51.3 s, total: 5min 56s
Wall time: 3h 20min 36s
#!pip install tensorboard
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir {logging_dir} --bind_all
To get back weights in fp32, read this: https://huggingface.co/transformers/master/main_classes/deepspeed.html#getting-the-model-weights-out
!wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/deepspeed/utils/zero_to_fp32.py
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file. ERROR: could not open HSTS store at '/mnt/home/pierre/.wget-hsts'. HSTS will be disabled. --2021-06-18 12:02:06-- https://raw.githubusercontent.com/microsoft/DeepSpeed/master/deepspeed/utils/zero_to_fp32.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 6468 (6.3K) [text/plain] Saving to: ‘zero_to_fp32.py.1’ zero_to_fp32.py.1 100%[===================>] 6.32K --.-KB/s in 0s 2021-06-18 12:02:06 (66.1 MB/s) - ‘zero_to_fp32.py.1’ saved [6468/6468]
# in this training, checkpoint-16734 contains the best model
%cd {output_dir}/'checkpoint-16734'
!ls -al
total 651872 drwxrwxr-x 3 pierre pierre 4096 Jun 18 11:38 . drwxrwxr-x 3 pierre pierre 4096 Jun 18 11:38 .. -rw-rw-r-- 1 pierre pierre 855 Jun 18 11:38 config.json drwxrwxr-x 2 pierre pierre 4096 Jun 18 11:38 global_step16734 -rw-rw-r-- 1 pierre pierre 16 Jun 18 11:38 latest -rw-rw-r-- 1 pierre pierre 666791233 Jun 18 11:38 pytorch_model.bin -rw-rw-r-- 1 pierre pierre 14657 Jun 18 11:38 rng_state_0.pth -rw-rw-r-- 1 pierre pierre 112 Jun 18 11:38 special_tokens_map.json -rw-rw-r-- 1 pierre pierre 438465 Jun 18 11:38 tokenizer.json -rw-rw-r-- 1 pierre pierre 506 Jun 18 11:38 tokenizer_config.json -rw-rw-r-- 1 pierre pierre 5051 Jun 18 11:38 trainer_state.json -rw-rw-r-- 1 pierre pierre 3951 Jun 18 11:38 training_args.bin -rw-rw-r-- 1 pierre pierre 209528 Jun 18 11:38 vocab.txt -rwxrw-r-- 1 pierre pierre 6468 Jun 18 11:38 zero_to_fp32.py
We can observe that pytorch_model.bin has a size of 666 MB because the weights model have been saved with a fp16 format. Let's use the script zero_to_fp32.py
from DeepSpeed in order to convert them to a fp32 format.
path_to_zero_to_fp32 = root/'zero_to_fp32.py'
!python $path_to_zero_to_fp32 -h
usage: zero_to_fp32.py [-h] checkpoint_dir output_file positional arguments: checkpoint_dir path to the deepspeed checkpoint folder, e.g., path/checkpoint-1/global_step1 output_file path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-1/pytorch_model.bin) optional arguments: -h, --help show this help message and exit
!python $path_to_zero_to_fp32 global_step16734 pytorch_model.bin
Processing zero checkpoint 'global_step16734' Detected checkpoint of type zero stage 2, world_size: 1 Saving fp32 state dict to pytorch_model.bin (total_numel=333348866)
!ls -al
total 1302912 drwxrwxr-x 3 pierre pierre 4096 Jun 18 11:38 . drwxrwxr-x 3 pierre pierre 4096 Jun 18 11:38 .. -rw-rw-r-- 1 pierre pierre 855 Jun 18 11:38 config.json drwxrwxr-x 2 pierre pierre 4096 Jun 18 11:38 global_step16734 -rw-rw-r-- 1 pierre pierre 16 Jun 18 11:38 latest -rw-rw-r-- 1 pierre pierre 1333453496 Jun 18 14:28 pytorch_model.bin -rw-rw-r-- 1 pierre pierre 14657 Jun 18 11:38 rng_state_0.pth -rw-rw-r-- 1 pierre pierre 112 Jun 18 11:38 special_tokens_map.json -rw-rw-r-- 1 pierre pierre 438465 Jun 18 11:38 tokenizer.json -rw-rw-r-- 1 pierre pierre 506 Jun 18 11:38 tokenizer_config.json -rw-rw-r-- 1 pierre pierre 5051 Jun 18 11:38 trainer_state.json -rw-rw-r-- 1 pierre pierre 3951 Jun 18 11:38 training_args.bin -rw-rw-r-- 1 pierre pierre 209528 Jun 18 11:38 vocab.txt -rwxrw-r-- 1 pierre pierre 6468 Jun 18 11:38 zero_to_fp32.py
That's it! The size of our model (1.3 GB) means that the weights format is now fp32.
%cd {root}
import pathlib
from pathlib import Path
# model source
source = root/output_dir/'checkpoint-16734/'
# model destination
dest = root/'HFmodels'
fname_HF = 'bert-large-cased-squad-v1.1-portuguese'
path_to_awesome_name_you_picked = dest/fname_HF
path_to_awesome_name_you_picked.mkdir(exist_ok=True, parents=True)
# copy model to destination
!cp {source}/'config.json' {dest/fname_HF}
!cp {source}/'pytorch_model.bin' {dest/fname_HF}
# copy tokenizer to destination
!cp {source}/'tokenizer_config.json' {dest/fname_HF}
!cp {source}/'special_tokens_map.json' {dest/fname_HF}
!cp {source}/'vocab.txt' {dest/fname_HF}
Make your model work on all frameworks (source)
from transformers import BertForQuestionAnswering
pt_model = BertForQuestionAnswering.from_pretrained(str(path_to_awesome_name_you_picked))
pt_model.save_pretrained(str(path_to_awesome_name_you_picked))
import tensorflow
from transformers import TFBertForQuestionAnswering
tf_model = TFBertForQuestionAnswering.from_pretrained(str(path_to_awesome_name_you_picked), from_pt=True)
tf_model.save_pretrained(str(path_to_awesome_name_you_picked))
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForQuestionAnswering: ['bert.embeddings.position_ids'] - This IS expected if you are initializing TFBertForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFBertForQuestionAnswering from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model). All the weights of TFBertForQuestionAnswering were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.
!ls -al {path_to_awesome_name_you_picked}
total 2605184 drwxrwxr-x 2 pierre pierre 4096 Jun 18 14:37 . drwxrwxr-x 3 pierre pierre 4096 Jun 18 14:34 .. -rw-rw-r-- 1 pierre pierre 918 Jun 18 14:37 config.json -rw-rw-r-- 1 pierre pierre 1333560247 Jun 18 14:37 pytorch_model.bin -rw-rw-r-- 1 pierre pierre 112 Jun 18 14:34 special_tokens_map.json -rw-rw-r-- 1 pierre pierre 1333906712 Jun 18 14:37 tf_model.h5 -rw-rw-r-- 1 pierre pierre 506 Jun 18 14:34 tokenizer_config.json -rw-rw-r-- 1 pierre pierre 209528 Jun 18 14:34 vocab.txt
Don't forget to upload your model on the 🤗 Model Hub. You can then use it only to generate results like the one shown in the first picture of this notebook!|
import gradio as gr
iface = gr.Interface.load("huggingface/pierreguillou/bert-large-cased-squad-v1.1-portuguese",server_name='xxxx')
iface.launch()
# xxxx is your server name (alias)
### import transformers
import pathlib
from pathlib import Path
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
model_qa = AutoModelForQuestionAnswering.from_pretrained(path_to_awesome_name_you_picked)
tokenizer_qa = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
from transformers import pipeline
nlp = pipeline("question-answering", model=model_qa, tokenizer=tokenizer_qa)
# source: https://pt.wikipedia.org/wiki/Pandemia_de_COVID-19
context = r"""A pandemia de COVID-19, também conhecida como pandemia de coronavírus, é uma pandemia em curso de COVID-19,
uma doença respiratória causada pelo coronavírus da síndrome respiratória aguda grave 2 (SARS-CoV-2).
O vírus tem origem zoonótica e o primeiro caso conhecido da doença remonta a dezembro de 2019 em Wuhan, na China.
Em 20 de janeiro de 2020, a Organização Mundial da Saúde (OMS) classificou o surto
como Emergência de Saúde Pública de Âmbito Internacional e, em 11 de março de 2020, como pandemia.
Em 18 de junho de 2021, 177 349 274 casos foram confirmados em 192 países e territórios,
com 3 840 181 mortes atribuídas à doença, tornando-se uma das pandemias mais mortais da história.
Os sintomas de COVID-19 são altamente variáveis, variando de nenhum a doenças com risco de morte.
O vírus se espalha principalmente pelo ar quando as pessoas estão perto umas das outras.
Ele deixa uma pessoa infectada quando ela respira, tosse, espirra ou fala e entra em outra pessoa pela boca, nariz ou olhos.
Ele também pode se espalhar através de superfícies contaminadas.
As pessoas permanecem contagiosas por até duas semanas e podem espalhar o vírus mesmo se forem assintomáticas.
"""
%%time
question = "Quando começou a pandemia de Covid-19 no mundo?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'dezembro de 2019', score: 0.5087, start: 290, end: 306 CPU times: user 1min 55s, sys: 7.79 s, total: 2min 2s Wall time: 3.52 s
%%time
question = "Qual é a data de início da pandemia Covid-19 em todo o mundo?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'dezembro de 2019', score: 0.4988, start: 290, end: 306 CPU times: user 1min 56s, sys: 6.79 s, total: 2min 3s Wall time: 3.5 s
%%time
question = "A Covid-19 tem algo a ver com animais?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'O vírus tem origem zoonótica', score: 0.6001, start: 213, end: 241 CPU times: user 1min 57s, sys: 13.8 s, total: 2min 11s Wall time: 3.76 s
%%time
question = "Onde foi descoberta a Covid-19?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'Wuhan, na China', score: 0.9415, start: 310, end: 325 CPU times: user 1min 57s, sys: 9.3 s, total: 2min 6s Wall time: 3.62 s
%%time
question = "Quantos casos houve?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: '177 349 274', score: 0.828, start: 536, end: 547 CPU times: user 1min 54s, sys: 11.6 s, total: 2min 6s Wall time: 3.62 s
%%time
question = "Quantos mortes?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: '3 840 181', score: 0.906, start: 606, end: 615 CPU times: user 1min 58s, sys: 13.3 s, total: 2min 11s Wall time: 3.77 s
%%time
question = "Quantos paises tiveram casos?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: '192', score: 0.8958, start: 575, end: 578 CPU times: user 1min 54s, sys: 10 s, total: 2min 4s Wall time: 3.56 s
%%time
question = "Quais são sintomas de COVID-19"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'nenhum a doenças com risco de morte', score: 0.298, start: 761, end: 796 CPU times: user 1min 56s, sys: 11.5 s, total: 2min 8s Wall time: 3.66 s
%%time
question = "Como se espalha o vírus?"
result = nlp(question=question, context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'principalmente pelo ar quando as pessoas estão perto umas das outras', score: 0.3173, start: 818, end: 886 CPU times: user 1min 52s, sys: 8.4 s, total: 2min 1s Wall time: 3.46 s