Fine-tuning BERT (base or large) on a question-answering task by using the libraries transformers (HF) and DeepSpeed (Microsoft)

Context

This notebook is an adaptation of the notebook question_answering.ipynb and of the script run_qa.py of Hugging Face (HF) for fine-tuning a (transformer) Masked Language Model (MLM) like BERT on the QA task with the Portuguese Squad 1.1 dataset.

In order to speed up the fine-tuning of the model on only one GPU, the library DeepSpeed is used by applying the configuration provided by HF in the notebook transformers + deepspeed CLI.

Note: the paragraph about Causal language modeling (CLM) is not included in this notebook, and all the non necessary code about Masked Model Language (MLM) has been deleted from the original notebook.

Installation

If you're opening this Notebook on colab, you will probably need to install đŸ€— Transformers and đŸ€— Datasets, and DeepSpeed. Uncomment the following cells and run it.

Pytorch

In [1]:
# !pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

DeepSpeed

In [2]:
# !pip install git+https://github.com/microsoft/deepspeed

Datasets, Tokenizers, Transformers

In [3]:
# git clone https://github.com/huggingface/transformers
# cd transformers
# # examples change a lot so let's pick a sha that we know this notebook will work with
# # comment out/remove the next line if you want the master
# # git checkout  d2753dcbec712350
# pip install -e .
# pip install -r examples/pytorch/translation/requirements.txt

In this notebook's folder, you need to create symbolic links to 3 files in the transformers folder you just installed.

In [5]:
#!ln -s ~/transformers/examples/pytorch/question-answering/run_qa.py
#!ln -s ~/transformers/examples/pytorch/question-answering/trainer_qa.py
#!ln -s ~/transformers/examples/pytorch/question-answering/utils_qa.py

Let's check our installation.

In [6]:
import sys; print('python:',sys.version)
import pathlib
from pathlib import Path

import torch; print('Pytorch:',torch.__version__)

import transformers; print('transformers:',transformers.__version__)
import tokenizers; print('tokenizers:',tokenizers.__version__)
import datasets; print('datasets:',datasets.__version__)

import deepspeed; print('deepspeed:',deepspeed.__version__)

# Versions installed:
# python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
# [GCC 7.5.0]
# Pytorch: 1.8.1+cu111
# transformers: 4.7.0.dev0
# tokenizers: 0.10.3
# datasets: 1.8.0
# deepspeed: 0.4.1+fa7921e
python: 3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
Pytorch: 1.8.1+cu111
transformers: 4.7.0.dev0
tokenizers: 0.10.3
datasets: 1.8.0
deepspeed: 0.4.1+fa7921e

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs here.

In this notebook, we will see how to fine-tune one of the đŸ€— Transformers model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the Trainer API to fine-tune a model on it.

Widget inference representing the QA task

Note: This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the Model Hub as long as that model has a version with a token classification head and a fast tokenizer (check on this table if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [7]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
batch_size = 16

BERT model

In [8]:
model_name_or_path = "neuralmind/bert-large-portuguese-cased"

Dataset

Loading the dataset

For our example here, we'll use the Portuguese Squad 1.1 dataset which is a translation of the English SQUAD dataset. The notebook should work with any question answering dataset provided by the đŸ€— Datasets library. If you're using your own dataset defined from a JSON or csv file (see the Datasets documentation on how to load them), it might need some adjustments in the names of the columns used.

In [9]:
dataset_name = "squad11pt"
In [10]:
# %%time
# if dataset_name == "squad11pt":
    
#     # create dataset folder 
#     root = Path.cwd()
#     path_to_dataset = root.parent/'data'/dataset_name
#     path_to_dataset.mkdir(parents=True, exist_ok=True) 

#     # Get dataset SQUAD in Portuguese
#     %cd {path_to_dataset}
#     !wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Q0IaIlv2h2BC468MwUFmUST0EyN7gNkn" -O squad-pt.tar.gz && rm -rf /tmp/cookies.txt

#     # unzip 
#     !tar -xvf squad-pt.tar.gz

#     # Get the train and validation json file in the HF script format 
#     # inspiration: file squad.py at https://github.com/huggingface/datasets/tree/master/datasets/squad

#     import json 
#     files = ['squad-train-v1.1.json','squad-dev-v1.1.json']

#     for file in files:

#         # Opening JSON file & returns JSON object as a dictionary 
#         f = open(file, encoding="utf-8") 
#         data = json.load(f) 

#         # Iterating through the json list 
#         entry_list = list()
#         id_list = list()

#         for row in data['data']: 
#             title = row['title']

#             for paragraph in row['paragraphs']:
#                 context = paragraph['context']

#                 for qa in paragraph['qas']:
#                     entry = {}

#                     qa_id = qa['id']
#                     question = qa['question']
#                     answers = qa['answers']

#                     entry['id'] = qa_id
#                     entry['title'] = title.strip()
#                     entry['context'] = context.strip()
#                     entry['question'] = question.strip()

#                     answer_starts = [answer["answer_start"] for answer in answers]
#                     answer_texts = [answer["text"].strip() for answer in answers]
#                     entry['answers'] = {}
#                     entry['answers']['answer_start'] = answer_starts
#                     entry['answers']['text'] = answer_texts

#                     entry_list.append(entry)

#         reverse_entry_list = entry_list[::-1]

#         # for entries with same id, keep only last one (corrected texts by the group Deep Learning Brasil)
#         unique_ids_list = list()
#         unique_entry_list = list()
#         for entry in reverse_entry_list:
#             qa_id = entry['id']
#             if qa_id not in unique_ids_list:
#                 unique_ids_list.append(qa_id)
#                 unique_entry_list.append(entry)

#         # Closing file 
#         f.close() 

#         new_dict = {}
#         new_dict['data'] = unique_entry_list

#         file_name = 'pt_' + str(file)
#         with open(file_name, 'w') as json_file:
#             json.dump(new_dict, json_file)
            
# %cd {root}

Check the dataset

We will use the đŸ€— Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset and load_metric.

In [ ]:
from datasets import load_dataset, load_metric

if dataset_name == "squad11pt":
    
    # dataset folder 
    root = Path.cwd()
    path_to_dataset = root.parent/'data'/dataset_name
    
    # paths to files
    train_file = str(path_to_dataset/'pt_squad-train-v1.1.json')
    validation_file = str(path_to_dataset/'pt_squad-dev-v1.1.json')
    
    datasets = load_dataset('json', 
                            data_files={'train': train_file, \
                                        'validation': validation_file, \
                                       }, 
                            field='data')

The datasets object itself is DatasetDict, which contains one key for the training, validation and test set.

In [12]:
datasets
Out[12]:
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87510
    })
    validation: Dataset({
        features: ['answers', 'context', 'id', 'question', 'title'],
        num_rows: 10570
    })
})

To access an actual element, you need to select a split first, then give an index:

In [13]:
datasets["train"][0]
Out[13]:
{'id': '5735d259012e2f140011a0a1',
 'title': 'Kathmandu',
 'context': 'A Cidade Metropolitana de Catmandu (KMC), a fim de promover as relaçÔes internacionais, criou uma Secretaria de RelaçÔes Internacionais (IRC). O primeiro relacionamento internacional da KMC foi estabelecido em 1975 com a cidade de Eugene, Oregon, Estados Unidos. Essa atividade foi aprimorada ainda mais com o estabelecimento de relaçÔes formais com outras 8 cidades: Cidade de Motsumoto, JapĂŁo, Rochester, EUA, Yangon (antiga Rangum) de Mianmar, Xian da RepĂșblica Popular da China, Minsk da BielorrĂșssia e Pyongyang de RepĂșblica DemocrĂĄtica da CorĂ©ia. O esforço constante da KMC Ă© aprimorar sua interação com os paĂ­ses da SAARC, outras agĂȘncias internacionais e muitas outras grandes cidades do mundo para alcançar melhores programas de gestĂŁo urbana e desenvolvimento para Katmandu.',
 'question': 'De que KMC Ă© um inicialismo?',
 'answers': {'answer_start': [2],
  'text': ['Cidade Metropolitana de Catmandu']}}

We can see the answers are indicated by their start position in the text (here at character 2) and their full text, which is a substring of the context as we mentioned above.

DeepSpeed

Let's setup the DeepSpeed configuration.

To use a LR Linear Decay after warmup as scheduler (equivalent of the one by default in the HF Trainer), we changed WarmupLR to WarmupDecayLR in the DeepSpeed configuration file, and kept a copy of the scheduler initial code here:

# the lr stays constant after the warmup (this is not equivalent to teh default scheduler of HF which is Linear)
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

ZeRO-2

In [14]:
%%bash

cat <<'EOT' > ds_config_zero2.json
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
         "type": "WarmupDecayLR",
         "params": {
             "last_batch_iteration": -1,
             "total_num_steps": "auto",
             "warmup_min_lr": "auto",
             "warmup_max_lr": "auto",
             "warmup_num_steps": "auto"
         }
     },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
EOT

ZeRO-3

Compared to ZeRO-2, the ZeRO-3 configuration allows to train larger models but also for a longer training time. For this reason, we will not be using the ZeRO-3 configuration.

In [15]:
%%bash

cat <<'EOT' > ds_config_zero3.json
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
         "type": "WarmupDecayLR",
         "params": {
             "last_batch_iteration": -1,
             "total_num_steps": "auto",
             "warmup_min_lr": "auto",
             "warmup_max_lr": "auto",
             "warmup_num_steps": "auto"
         }
     },
     
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
EOT

Training arguments

Let's setup all the training arguments needed by the script run_qa.py.

GPU

In [16]:
num_gpus = 1 # run the script on only one gpu
gpu = 0 # select the gpu

model, dataset, sequence

In [17]:
# setup the training argument
do_train = True # False
do_eval = True 

if dataset_name == "squad11pt":
    
    # dataset folder 
    root = Path.cwd()
    path_to_dataset = root.parent/'data'/dataset_name
    
    # paths to files
    train_file = str(path_to_dataset/'pt_squad-train-v1.1.json')
    validation_file = str(path_to_dataset/'pt_squad-dev-v1.1.json')
                          
# if you want to test the trainer, set up the following variables
max_train_samples = 200 # None
max_eval_samples = 50 # None

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
max_seq_length = 384
                          
# Whether to pad all samples to `max_seq_length`.
# If False, will pad the samples dynamically when batching to the maximum length in the batch
# (which can be faster on GPU but will be slower on TPU).
pad_to_max_length = True
        
# If true, some of the examples do not have an answer.
version_2_with_negative = False

# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128

# The total number of n-best predictions to generate when looking for an answer.
n_best_size = 20
                          
# The maximum length of an answer that can be generated. This is needed because the start
# and end predictions are not conditioned on one another.                          
max_answer_length = 30

training_args()

If you keep the value 1e-8 for adam_epsilon in fp16 mode, it is zero. The first non-zero value is 1e-7 in this mode. After some testing, we found that adam_epsilon = 1e-4 gives the best results.

We use the ZeRO-2 mode for DeepSpeed.

In [18]:
# epochs, bs, GA
evaluation_strategy = "epoch" # no
BS = batch_size
gradient_accumulation_steps = 1

# optimizer (AdamW)
learning_rate = 5e-5
weight_decay = 0.01 # 0.0
adam_beta1 = 0.9
adam_beta2 = 0.999
adam_epsilon = 1e-4 # 1e-08

# epochs
num_train_epochs = 3.

# scheduler
lr_scheduler_type = 'linear'
warmup_ratio = 0.0
warmup_steps = 0

# logs
logging_strategy = "steps"
logging_first_step = True # False
logging_steps = 500     # if strategy = "steps"
eval_steps = logging_steps # logging_steps

# checkpoints
save_strategy = "epoch" # steps
save_steps = 500 # if save_strategy = "steps"
save_total_limit = 1 # None

# no cuda, seed
no_cuda = False
seed = 42

# fp16
fp16 = True # False
fp16_opt_level = 'O1'
fp16_backend = "auto"
fp16_full_eval = False

# bar
disable_tqdm = False # True
remove_unused_columns = True
#label_names (List[str], optional) 

# best model
load_best_model_at_end = True # False
metric_for_best_model = "eval_f1"
greater_is_better = True

# deepspeed
zero = 2

if zero == 2:
    deepspeed_config = "ds_config_zero2.json"
elif zero == 3:
    deepspeed_config = "ds_config_zero3.json"
In [19]:
# folder for training outputs
outputs = model_name_or_path.replace('/','-') + '-' + dataset_name \
+ '_wd' + str(weight_decay) + '_eps' + str(adam_epsilon) \
+ '_epochs' + str(num_train_epochs) \
+ '-lr' + str(learning_rate)
path_to_outputs = root/'models_outputs'/outputs

# subfolder for model outputs
output_dir = path_to_outputs/'output_dir' 
overwrite_output_dir = True # False

# logs
logging_dir = path_to_outputs/'logging_dir'

Training + Evaluation

Update the system path with the virtual environment path

This is needed to launch the deepspeed command in our server configuration.

In [ ]:
import os
PATH = os.getenv('PATH')
%env PATH=/mnt/home/xxxx/anaconda3/envs/aaaa/bin:$PATH
    
# xxxx is the folder name where anaconda was installed
# aaaa is the virtual ambiente name within this notebook is run

Setup environment variables

The magic command %env corresponds to export in linux. It allows to setup the values of all arguments of the script run_qa.py.

Note: as we noticed that the script runs without environment variables in this notebook but with local ones, we do not use this magic command.

Delete the output_dir (if exists)

In [ ]:
!rm -r {output_dir}

Now, we can launch the training :-)

In [24]:
# copy/paste/uncomment the 2 following lines in the following cell if you want to limit the number of data (useful for testing)
# --max_train_samples $max_train_samples \
# --max_eval_samples $max_eval_samples \
In [ ]:
%%time
# !deepspeed --num_gpus=$num_gpus run_qa.py \
!deepspeed --include localhost:$gpu run_qa.py \
--model_name_or_path $model_name_or_path \
--train_file $train_file \
--do_train $do_train \
--do_eval $do_eval \
--validation_file $validation_file \
--max_seq_length $max_seq_length \
--pad_to_max_length $pad_to_max_length \
--version_2_with_negative $version_2_with_negative \
--doc_stride $doc_stride \
--n_best_size $n_best_size \
--max_answer_length $max_answer_length \
--output_dir $output_dir \
--overwrite_output_dir $overwrite_output_dir \
--evaluation_strategy $evaluation_strategy \
--per_device_train_batch_size $batch_size \
--per_device_eval_batch_size $batch_size \
--gradient_accumulation_steps $gradient_accumulation_steps \
--learning_rate $learning_rate \
--weight_decay $weight_decay \
--adam_beta1 $adam_beta1 \
--adam_beta2 $adam_beta2 \
--adam_epsilon $adam_epsilon \
--num_train_epochs $num_train_epochs \
--warmup_ratio $warmup_ratio \
--warmup_steps $warmup_steps \
--logging_dir $logging_dir \
--logging_strategy $logging_strategy \
--logging_first_step $logging_first_step \
--logging_steps $logging_steps \
--eval_steps $eval_steps \
--save_strategy $save_strategy \
--save_steps $save_steps \
--save_total_limit $save_total_limit \
--no_cuda $no_cuda \
--seed $seed \
--fp16 $fp16 \
--fp16_opt_level $fp16_opt_level \
--fp16_backend $fp16_backend \
--fp16_full_eval $fp16_full_eval \
--disable_tqdm $disable_tqdm \
--remove_unused_columns $remove_unused_columns \
--load_best_model_at_end $load_best_model_at_end \
--metric_for_best_model $metric_for_best_model \
--greater_is_better $greater_is_better \
--deepspeed $deepspeed_config

Results

[INFO|trainer_pt_utils.py:908] 2021-06-18 05:32:34,020 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >>   epoch            =     3.0
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >>   eval_exact_match = 72.6774
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >>   eval_f1          = 84.4315
[INFO|trainer_pt_utils.py:913] 2021-06-18 05:32:34,020 >>   eval_samples     =   10917
CPU times: user 5min 5s, sys: 51.3 s, total: 5min 56s
Wall time: 3h 20min 36s

TensorBoard

In [4]:
#!pip install tensorboard
In [26]:
%load_ext tensorboard
# %reload_ext tensorboard
%tensorboard --logdir {logging_dir} --bind_all

Getting The Model Weights Out

In [21]:
!wget https://raw.githubusercontent.com/microsoft/DeepSpeed/master/deepspeed/utils/zero_to_fp32.py
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/mnt/home/pierre/.wget-hsts'. HSTS will be disabled.
--2021-06-18 12:02:06--  https://raw.githubusercontent.com/microsoft/DeepSpeed/master/deepspeed/utils/zero_to_fp32.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6468 (6.3K) [text/plain]
Saving to: ‘zero_to_fp32.py.1’

zero_to_fp32.py.1   100%[===================>]   6.32K  --.-KB/s    in 0s      

2021-06-18 12:02:06 (66.1 MB/s) - ‘zero_to_fp32.py.1’ saved [6468/6468]

In [ ]:
# in this training, checkpoint-16734 contains the best model
%cd {output_dir}/'checkpoint-16734'
In [32]:
!ls -al
total 651872
drwxrwxr-x 3 pierre pierre      4096 Jun 18 11:38 .
drwxrwxr-x 3 pierre pierre      4096 Jun 18 11:38 ..
-rw-rw-r-- 1 pierre pierre       855 Jun 18 11:38 config.json
drwxrwxr-x 2 pierre pierre      4096 Jun 18 11:38 global_step16734
-rw-rw-r-- 1 pierre pierre        16 Jun 18 11:38 latest
-rw-rw-r-- 1 pierre pierre 666791233 Jun 18 11:38 pytorch_model.bin
-rw-rw-r-- 1 pierre pierre     14657 Jun 18 11:38 rng_state_0.pth
-rw-rw-r-- 1 pierre pierre       112 Jun 18 11:38 special_tokens_map.json
-rw-rw-r-- 1 pierre pierre    438465 Jun 18 11:38 tokenizer.json
-rw-rw-r-- 1 pierre pierre       506 Jun 18 11:38 tokenizer_config.json
-rw-rw-r-- 1 pierre pierre      5051 Jun 18 11:38 trainer_state.json
-rw-rw-r-- 1 pierre pierre      3951 Jun 18 11:38 training_args.bin
-rw-rw-r-- 1 pierre pierre    209528 Jun 18 11:38 vocab.txt
-rwxrw-r-- 1 pierre pierre      6468 Jun 18 11:38 zero_to_fp32.py

We can observe that pytorch_model.bin has a size of 666 MB because the weights model have been saved with a fp16 format. Let's use the script zero_to_fp32.py from DeepSpeed in order to convert them to a fp32 format.

In [34]:
path_to_zero_to_fp32 = root/'zero_to_fp32.py'
!python $path_to_zero_to_fp32 -h
usage: zero_to_fp32.py [-h] checkpoint_dir output_file

positional arguments:
  checkpoint_dir  path to the deepspeed checkpoint folder, e.g.,
                  path/checkpoint-1/global_step1
  output_file     path to the pytorch fp32 state_dict output file (e.g.
                  path/checkpoint-1/pytorch_model.bin)

optional arguments:
  -h, --help      show this help message and exit
In [35]:
!python $path_to_zero_to_fp32 global_step16734 pytorch_model.bin
Processing zero checkpoint 'global_step16734'
Detected checkpoint of type zero stage 2, world_size: 1
Saving fp32 state dict to pytorch_model.bin (total_numel=333348866)
In [38]:
!ls -al
total 1302912
drwxrwxr-x 3 pierre pierre       4096 Jun 18 11:38 .
drwxrwxr-x 3 pierre pierre       4096 Jun 18 11:38 ..
-rw-rw-r-- 1 pierre pierre        855 Jun 18 11:38 config.json
drwxrwxr-x 2 pierre pierre       4096 Jun 18 11:38 global_step16734
-rw-rw-r-- 1 pierre pierre         16 Jun 18 11:38 latest
-rw-rw-r-- 1 pierre pierre 1333453496 Jun 18 14:28 pytorch_model.bin
-rw-rw-r-- 1 pierre pierre      14657 Jun 18 11:38 rng_state_0.pth
-rw-rw-r-- 1 pierre pierre        112 Jun 18 11:38 special_tokens_map.json
-rw-rw-r-- 1 pierre pierre     438465 Jun 18 11:38 tokenizer.json
-rw-rw-r-- 1 pierre pierre        506 Jun 18 11:38 tokenizer_config.json
-rw-rw-r-- 1 pierre pierre       5051 Jun 18 11:38 trainer_state.json
-rw-rw-r-- 1 pierre pierre       3951 Jun 18 11:38 training_args.bin
-rw-rw-r-- 1 pierre pierre     209528 Jun 18 11:38 vocab.txt
-rwxrw-r-- 1 pierre pierre       6468 Jun 18 11:38 zero_to_fp32.py

That's it! The size of our model (1.3 GB) means that the weights format is now fp32.

In [ ]:
%cd {root}

Save the model to HF format

In [40]:
import pathlib
from pathlib import Path
In [41]:
# model source
source = root/output_dir/'checkpoint-16734/'

# model destination
dest = root/'HFmodels'
fname_HF = 'bert-large-cased-squad-v1.1-portuguese'
path_to_awesome_name_you_picked = dest/fname_HF
path_to_awesome_name_you_picked.mkdir(exist_ok=True, parents=True)
In [42]:
# copy model to destination
!cp {source}/'config.json' {dest/fname_HF}
!cp {source}/'pytorch_model.bin' {dest/fname_HF}

# copy tokenizer to destination
!cp {source}/'tokenizer_config.json' {dest/fname_HF}
!cp {source}/'special_tokens_map.json' {dest/fname_HF}
!cp {source}/'vocab.txt' {dest/fname_HF}

Make your model work on all frameworks (source)

In [46]:
from transformers import BertForQuestionAnswering
pt_model = BertForQuestionAnswering.from_pretrained(str(path_to_awesome_name_you_picked))
pt_model.save_pretrained(str(path_to_awesome_name_you_picked))
In [48]:
import tensorflow
from transformers import TFBertForQuestionAnswering

tf_model = TFBertForQuestionAnswering.from_pretrained(str(path_to_awesome_name_you_picked), from_pt=True)
tf_model.save_pretrained(str(path_to_awesome_name_you_picked))
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForQuestionAnswering: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForQuestionAnswering from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForQuestionAnswering for predictions without further training.
In [49]:
!ls -al {path_to_awesome_name_you_picked}
total 2605184
drwxrwxr-x 2 pierre pierre       4096 Jun 18 14:37 .
drwxrwxr-x 3 pierre pierre       4096 Jun 18 14:34 ..
-rw-rw-r-- 1 pierre pierre        918 Jun 18 14:37 config.json
-rw-rw-r-- 1 pierre pierre 1333560247 Jun 18 14:37 pytorch_model.bin
-rw-rw-r-- 1 pierre pierre        112 Jun 18 14:34 special_tokens_map.json
-rw-rw-r-- 1 pierre pierre 1333906712 Jun 18 14:37 tf_model.h5
-rw-rw-r-- 1 pierre pierre        506 Jun 18 14:34 tokenizer_config.json
-rw-rw-r-- 1 pierre pierre     209528 Jun 18 14:34 vocab.txt

Model sharing and uploading to the HF models hub

Don't forget to upload your model on the đŸ€— Model Hub. You can then use it only to generate results like the one shown in the first picture of this notebook!|

Use our QA model

Gradio

In [ ]:
import gradio as gr

iface = gr.Interface.load("huggingface/pierreguillou/bert-large-cased-squad-v1.1-portuguese",server_name='xxxx')
iface.launch()

# xxxx is your server name (alias)

Code

In [ ]:
### import transformers
import pathlib
from pathlib import Path
In [51]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

model_qa = AutoModelForQuestionAnswering.from_pretrained(path_to_awesome_name_you_picked)
tokenizer_qa = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
In [55]:
from transformers import pipeline
nlp = pipeline("question-answering", model=model_qa, tokenizer=tokenizer_qa)
In [93]:
# source: https://pt.wikipedia.org/wiki/Pandemia_de_COVID-19
context = r"""A pandemia de COVID-19, também conhecida como pandemia de coronavírus, é uma pandemia em curso de COVID-19, 
uma doença respiratória causada pelo coronavírus da síndrome respiratória aguda grave 2 (SARS-CoV-2). 
O vírus tem origem zoonótica e o primeiro caso conhecido da doença remonta a dezembro de 2019 em Wuhan, na China. 
Em 20 de janeiro de 2020, a Organização Mundial da SaĂșde (OMS) classificou o surto 
como EmergĂȘncia de SaĂșde PĂșblica de Âmbito Internacional e, em 11 de março de 2020, como pandemia. 
Em 18 de junho de 2021, 177 349 274 casos foram confirmados em 192 paĂ­ses e territĂłrios, 
com 3 840 181 mortes atribuídas à doença, tornando-se uma das pandemias mais mortais da história.
Os sintomas de COVID-19 são altamente variåveis, variando de nenhum a doenças com risco de morte. 
O vĂ­rus se espalha principalmente pelo ar quando as pessoas estĂŁo perto umas das outras. 
Ele deixa uma pessoa infectada quando ela respira, tosse, espirra ou fala e entra em outra pessoa pela boca, nariz ou olhos.
Ele também pode se espalhar através de superfícies contaminadas. 
As pessoas permanecem contagiosas por até duas semanas e podem espalhar o vírus mesmo se forem assintomåticas.
"""
In [95]:
%%time
question = "Quando começou a pandemia de Covid-19 no mundo?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'dezembro de 2019', score: 0.5087, start: 290, end: 306
CPU times: user 1min 55s, sys: 7.79 s, total: 2min 2s
Wall time: 3.52 s
In [96]:
%%time
question = "Qual Ă© a data de inĂ­cio da pandemia Covid-19 em todo o mundo?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'dezembro de 2019', score: 0.4988, start: 290, end: 306
CPU times: user 1min 56s, sys: 6.79 s, total: 2min 3s
Wall time: 3.5 s
In [98]:
%%time
question = "A Covid-19 tem algo a ver com animais?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'O vĂ­rus tem origem zoonĂłtica', score: 0.6001, start: 213, end: 241
CPU times: user 1min 57s, sys: 13.8 s, total: 2min 11s
Wall time: 3.76 s
In [99]:
%%time
question = "Onde foi descoberta a Covid-19?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'Wuhan, na China', score: 0.9415, start: 310, end: 325
CPU times: user 1min 57s, sys: 9.3 s, total: 2min 6s
Wall time: 3.62 s
In [100]:
%%time
question = "Quantos casos houve?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: '177 349 274', score: 0.828, start: 536, end: 547
CPU times: user 1min 54s, sys: 11.6 s, total: 2min 6s
Wall time: 3.62 s
In [101]:
%%time
question = "Quantos mortes?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: '3 840 181', score: 0.906, start: 606, end: 615
CPU times: user 1min 58s, sys: 13.3 s, total: 2min 11s
Wall time: 3.77 s
In [102]:
%%time
question = "Quantos paises tiveram casos?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: '192', score: 0.8958, start: 575, end: 578
CPU times: user 1min 54s, sys: 10 s, total: 2min 4s
Wall time: 3.56 s
In [103]:
%%time
question = "Quais sĂŁo sintomas de COVID-19"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'nenhum a doenças com risco de morte', score: 0.298, start: 761, end: 796
CPU times: user 1min 56s, sys: 11.5 s, total: 2min 8s
Wall time: 3.66 s
In [104]:
%%time
question = "Como se espalha o vĂ­rus?"

result = nlp(question=question, context=context)

print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'principalmente pelo ar quando as pessoas estĂŁo perto umas das outras', score: 0.3173, start: 818, end: 886
CPU times: user 1min 52s, sys: 8.4 s, total: 2min 1s
Wall time: 3.46 s

END