Notebook

Fast inference for Hugging Face tasks models on GPU (for example: QA model)¶

Author: Pierre Guillou
Version & Date : v1 (10/22/2021)
Blog post: NLP nas empresas | Técnicas para acelerar modelos de Deep Learning para inferência em produção
Other notebook : fast_inference_transformers_on_CPU.ipynb

Notebook overview¶

Objective¶

The objective of this notebook is to help those who want to accelerate inference time on GPU for tasks models of Hugging Face (NER, QA, Classification...).

Method for inference¶

source: https://discuss.pytorch.org/t/model-eval-vs-with-torch-no-grad/19615/2

model.eval() will notify all your layers that you are in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode.
torch.no_grad() impacts the autograd engine and deactivate it. It will reduce memory usage and speed up computations but you won’t be able to backprop (which you don’t want in an eval script).

References¶

post blog from HF and Microsoft: Accelerate your NLP pipelines using Hugging Face Transformers and ONNX Runtime (19/05/2020)

Note: data from the notebook fast_inference_transformers_on_CPU.ipynb

In [1]:

mean_time_cpu, pipeline_mean_time_cpu = 889.07, 872.47
onnx_mean_time_cpu, torchscript_mean_time_cpu = 711.5, 849

System overview¶

In [2]:

import platform
platform.platform()

Out[2]:

'Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic'

CPU¶

In [3]:

from psutil import *

In [4]:

cpu_count(),cpu_stats()

Out[4]:

(2,
 scpustats(ctx_switches=655478, interrupts=338520, soft_interrupts=322078, syscalls=0))

In [5]:

!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2299.998
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4599.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2299.998
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4599.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

In [6]:

!df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay          79G   47G   32G  60% /
tmpfs            64M     0   64M   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
shm             5.8G     0  5.8G   0% /dev/shm
/dev/root       2.0G  1.2G  821M  59% /sbin/docker-init
/dev/sda1        86G   51G   35G  60% /opt/bin/.nvidia
tmpfs           6.4G   28K  6.4G   1% /var/colab
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware

In [7]:

virtual_memory()

Out[7]:

svmem(total=13622194176, available=12734242816, percent=6.5, used=603373568, free=10599632896, active=1086910464, inactive=1633361920, buffers=135700480, cached=2283487232, shared=1212416, slab=199516160)

GPU¶

Let's check the CUDA configuration of our Colab notebook (firstly, you need to associate a GPU to your notebook as Google Colab comes with both options GPU or without GPU. You can enable or disable GPU in runtime settings: go to Menu > Runtime > Change runtime, select GPU and save your notebook settings.

In [8]:

!nvidia-smi

Sat Oct 23 17:02:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    25W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In [9]:

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

On 22/10/2021, ONNX Runtime GPU 1.9 needs CUDA 11.4 and cuDNN 8.2.4 (Linux) in Ubuntu 18.4 (check Requirements).

If this is not the current situation of this Colab notebook (see results from previous cells), you need to install another CUDA version that fits with ONNX Runtime GPU 1.9

To do that, you will find the links to CUDA 11.4 here. Click on CUDA Toolkit 11.4.2 (September 2021), select the options as following:

Then, copy/paste in the following cell the code proposed by NVIDIA and run it with an exclamation point (!) at the beginning of each line. Just wait few minutes, check the CUDA version and that's it!

In [10]:

%%time
%%capture
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
!sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/11.4.2/local_installers/cuda-repo-ubuntu1804-11-4-local_11.4.2-470.57.02-1_amd64.deb
!sudo dpkg -i cuda-repo-ubuntu1804-11-4-local_11.4.2-470.57.02-1_amd64.deb
!sudo apt-key add /var/cuda-repo-ubuntu1804-11-4-local/7fa2af80.pub
!sudo apt-get update
!sudo apt-get -y install cuda

CPU times: user 3.41 s, sys: 605 ms, total: 4.02 s
Wall time: 8min 14s

In [11]:

!nvidia-smi

Sat Oct 23 17:10:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    25W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In [12]:

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_19:13:29_PDT_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

Great! We have the right CUDA version for ONNX Runtime 1.9!

Installation¶

In [13]:

%%capture
!pip install transformers

In [14]:

import transformers, torch, numpy as np

print("transformers:",transformers.__version__)
print("torch:",torch.__version__)
print("numpy:",np.__version__)

transformers: 4.11.3
torch: 1.9.0+cu111
numpy: 1.19.5

In [15]:

from time import perf_counter
def timer(f,*args):   
    start = perf_counter()
    f(*args)
    return (1000 * (perf_counter() - start))

QA model¶

Model at https://huggingface.co/pierreguillou/bert-base-cased-squad-v1.1-portuguese

In [16]:

model_checkpoint = "pierreguillou/bert-base-cased-squad-v1.1-portuguese"
# model_checkpoint = "pierreguillou/bert-large-cased-squad-v1.1-portuguese"

In [17]:

context = r"""
A pandemia de COVID-19, também conhecida como pandemia de coronavírus, é uma pandemia em curso de COVID-19, 
uma doença respiratória aguda causada pelo coronavírus da síndrome respiratória aguda grave 2 (SARS-CoV-2). 
A doença foi identificada pela primeira vez em Wuhan, na província de Hubei, República Popular da China, 
em 1 de dezembro de 2019, mas o primeiro caso foi reportado em 31 de dezembro do mesmo ano. 
Acredita-se que o vírus tenha uma origem zoonótica, porque os primeiros casos confirmados 
tinham principalmente ligações ao Mercado Atacadista de Frutos do Mar de Huanan, que também vendia animais vivos. 
Em 11 de março de 2020, a Organização Mundial da Saúde declarou o surto uma pandemia. Até 8 de fevereiro de 2021, 
pelo menos 105 743 102 casos da doença foram confirmados em pelo menos 191 países e territórios, 
com cerca de 2 308 943 mortes e 58 851 440 pessoas curadas.
"""

question = "Quando começou a pandemia de Covid-19 no mundo?"

1. Check the model¶

Before evaluating its inference time, let's check that our QA model is working well.

1.1 Without pipeline¶

In [18]:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
model.eval();

Downloading:   0%|          | 0.00/494 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/862 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

In [19]:

# code source: https://huggingface.co/transformers/master/task_summary.html#extractive-question-answering

# tokenize inputs
inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")

# get outputs
outputs = model(**inputs)
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits

# Get the most likely beginning of answer with the argmax of the score
answer_start = torch.argmax(answer_start_scores)
# Get the most likely end of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1

input_ids = inputs["input_ids"].tolist()[0]
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

# print answer
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: Quando começou a pandemia de Covid-19 no mundo?
Answer: 1 de dezembro de 2019

That's the right answer!

1.2 With pipeline¶

We can use Pipeline, too.

In [20]:

from transformers import pipeline

nlp = pipeline("question-answering", model=model_checkpoint)

In [21]:

# get result
result = nlp(question, context)

# print answer
print(f"Question: {question}")
print(f"Answer: {result['answer']} (score: {round(result['score'], 4)})")

Question: Quando começou a pandemia de Covid-19 no mundo?
Answer: 1 de dezembro de 2019 (score: 0.713)

That's the right answer!

2. Inference time | PyTorch model¶

2.1 PyTorch model (without pipeline)¶

In [22]:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
model.eval();

1. Tokenize the inputs¶

In [23]:

num=100

total = 0
for i in range(num):
  start = perf_counter()
  inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
  diff = perf_counter() - start
  total += diff

mean_tokenizer = round((total/num)*1000,2)
print(f'average time: {mean_tokenizer} ms')

average time: 0.76 ms

2. Model on GPU¶

In [24]:

# put model and inputs to gpu
model = model.to('cuda')
inputs = inputs.to('cuda')

# get mean time
with torch.no_grad():
  mean_time_gpu = round(np.mean([timer(model,inputs.input_ids,inputs.token_type_ids,inputs.attention_mask) for _ in range(100)]),2)

print(f'average time: {mean_time_gpu} ms')

average time: 44.42 ms

Now, we can evaluate the time to get the answer.

In [25]:

# get outputs
with torch.no_grad(): 
  outputs = model(**inputs)

In [26]:

num = 100

total = 0
for i in range(num):
  start = perf_counter()
  answer_start_scores = outputs.start_logits
  answer_end_scores = outputs.end_logits

  # Get the most likely beginning of answer with the argmax of the score
  answer_start = torch.argmax(answer_start_scores)
  # Get the most likely end of answer with the argmax of the score
  answer_end = torch.argmax(answer_end_scores) + 1

  input_ids = inputs["input_ids"].tolist()[0]
  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
  diff = perf_counter() - start
  total += diff

mean_time_gpu_answer = round((total/num)*1000,2)
print(f'average time: {mean_time_gpu_answer} ms')

average time: 1.52 ms

Then, we have the total time when the model is on the GPU:

In [27]:

total_gpu = round(mean_tokenizer + mean_time_gpu + mean_time_gpu_answer,2)
print(f'time: {total_gpu} ms')

time: 46.7 ms

2.2 PyTorch model (with pipeline)¶

We can use Pipeline, too.

In [28]:

from transformers import pipeline

Model on GPU¶

We have the total time when the model is on the GPU:

In [29]:

# put model and inputs to gpu
nlp = pipeline("question-answering", model=model_checkpoint, use_fast=True, device=0)

# get mean time
with torch.no_grad():
  pipeline_mean_time_gpu = round(np.mean([timer(nlp,question,context) for _ in range(100)]),2)

print(f'average time: {pipeline_mean_time_gpu} ms')

/usr/local/lib/python3.7/dist-packages/transformers/pipelines/base.py:901: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  UserWarning,

average time: 48.31 ms

2.3 Results with PyTorch¶

Let's compage the GPU results with the ones with CPU (check link to the CPU notebook at the beginning of this notebook.

In [30]:

import pandas as pd

raw_data = {
            'Latency on CPU (ms)': [mean_time_cpu, pipeline_mean_time_cpu],
            'Latency on GPU (ms)': [mean_time_gpu, pipeline_mean_time_gpu],
            }

df = pd.DataFrame(raw_data,
                  index=pd.Index(['Without pipeline', 'With pipeline']),
                  columns=pd.Index(['Latency on CPU (ms)','Latency on GPU (ms)']))

df

Out[30]:

	Latency on CPU (ms)	Latency on GPU (ms)
Without pipeline	889.07	44.42
With pipeline	872.47	48.31

In [31]:

import numpy as np
import matplotlib.pyplot as plt

labels = ['CPU','GPU']
data = [
        [mean_time_cpu, mean_time_gpu],
        [pipeline_mean_time_cpu, pipeline_mean_time_gpu]
        ]

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

X = np.arange(2)
ax.bar(X - 0.1, data[0], color = 'r', width = 0.2, label='Without pipeline')
ax.bar(X + 0.1, data[1], color = 'g', width = 0.2, label='With pipeline')

# axes and title
x = np.arange(len(labels))  # the label locations
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Latency (ms)')
ax.set_title('Inference latency of PyTorch model (CPU vs GPU) in Google Colab')

leg = ax.legend();

In [32]:

import numpy as np
import matplotlib.pyplot as plt

labels = ['GPU']
data = [mean_time_gpu, pipeline_mean_time_gpu]

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
 
X = np.arange(1)
ax.bar(X - 0.1, data[0], color = 'r', width = 0.2, label='PyTorch (without pipeline)')
ax.bar(X + 0.1, data[1], color = 'g', width = 0.2, label='PyTorch (with pipeline)')

# axes and title
x = np.arange(len(labels))  # the label locations
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Latency (ms)')
ax.set_title('Inference latency of PyTorch model (GPU) in Google Colab')

leg = ax.legend();

Pipeline does not help improve latency neither on CPU and GPU.
Using the GPU allows inferring up to 19 times faster than with the CPU (45 ms for the PyTorch model without a pipeline).

3. Inference time | ONNX Runtime¶

ONNX Runtime helps accelerate PyTorch and TensorFlow models in production, on CPU or GPU. As an open source library built for performance and broad platform support, ONNX Runtime is used in products and services handling over 20 billion inferences each day.

3.1 Old method | convert_graph_to_onnx.py¶

You can use ONNX Runtime and Hugging Face Transformers together to improve the experience of training and deploying NLP models. Hugging Face has made it easy to inference Transformer models with ONNX Runtime with the transformers/convert_graph_to_onnx.py which generates a model that can be loaded by ONNX Runtime.

In [33]:

!python -m transformers.onnx --help

usage: Hugging Face ONNX Exporter tool [-h] -m MODEL
                                       [--feature {causal-lm,causal-lm-with-past,default,default-with-past,masked-lm,seq2seq-lm,seq2seq-lm-with-past,sequence-classification,sequence-classification-with-past,token-classification}]
                                       [--opset OPSET] [--atol ATOL]
                                       output

positional arguments:
  output                Path indicating where to store generated ONNX model.

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Model's name of path on disk to load.
  --feature {causal-lm,causal-lm-with-past,default,default-with-past,masked-lm,seq2seq-lm,seq2seq-lm-with-past,sequence-classification,sequence-classification-with-past,token-classification}
                        Export the model with some additional feature.
  --opset OPSET         ONNX opset version to export the model with (default
                        12).
  --atol ATOL           Absolute difference tolerence when validating the
                        model.

SUPPORTED_PIPELINES = [
    "feature-extraction",
    "ner",
    "sentiment-analysis",
    "fill-mask",
    "question-answering",
    "text-generation",
    "translation_en_to_fr",
    "translation_en_to_de",
    "translation_en_to_ro",
]

Get the file convert_graph_to_onnx.py

In [34]:

!wget https://raw.githubusercontent.com/huggingface/transformers/master/src/transformers/convert_graph_to_onnx.py

--2021-10-23 17:12:06--  https://raw.githubusercontent.com/huggingface/transformers/master/src/transformers/convert_graph_to_onnx.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18640 (18K) [text/plain]
Saving to: ‘convert_graph_to_onnx.py’

convert_graph_to_on 100%[===================>]  18.20K  --.-KB/s    in 0s      

2021-10-23 17:12:06 (60.0 MB/s) - ‘convert_graph_to_onnx.py’ saved [18640/18640]

3.1.1 Tokenize the inputs¶

In [35]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [36]:

num=100

total = 0
for i in range(num):
  start = perf_counter()
  # WARNING!!!!!!! return_tensors="np" and not return_tensors="pt"
  inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="np")
  diff = perf_counter() - start
  total += diff

onnx_mean_tokenizer = round((total/num)*1000,2)
print(f'average time: {onnx_mean_tokenizer} ms')

average time: 0.83 ms

3.1.2 Model on GPU¶

Installation¶

In [37]:

# uninstall onnxruntime cpu
# !pip uninstall onnxruntime

In [38]:

%%capture
# onnxruntime gpu
!pip install onnx
!pip install onnxruntime-gpu

In [39]:

import onnxruntime as ort
ort.get_device()

Out[39]:

'GPU'

In [40]:

import onnxruntime
print("onnxruntime:",onnxruntime.__version__)

onnxruntime: 1.9.0

Convert the transformer model to its quantized onnx version¶

In [41]:

model_checkpoint_onnx = 'onnx_gpu/' + model_checkpoint.replace('/','-') + '.onnx'

!python convert_graph_to_onnx.py \
--pipeline question-answering \
--model {model_checkpoint} \
--tokenizer {model_checkpoint} \
--framework pt \
--opset 11 \
--check-loading \
--use-external-format \
--quantize \
{model_checkpoint_onnx}

====== Converting model to ONNX ======
ONNX opset version set to: 11
Loading pipeline (model: pierreguillou/bert-base-cased-squad-v1.1-portuguese, tokenizer: pierreguillou/bert-base-cased-squad-v1.1-portuguese)
Creating folder /content/onnx_gpu
Using framework PyTorch: 1.9.0+cu111
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch', 1: 'sequence'}
Found output output_1 with shape: {0: 'batch', 1: 'sequence'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']

====== Optimizing ONNX model ======
/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:353: UserWarning: Deprecation warning. This ORT build has ['CUDAExecutionProvider', 'CPUExecutionProvider'] enabled. The next release (ORT 1.10) will require explicitly setting the providers parameter (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.For example, onnxruntime.InferenceSession(..., providers=["CUDAExecutionProvider"], ...)
"based on the build flags) when instantiating InferenceSession."
2021-10-23 17:12:53.079006789 [W:onnxruntime:, inference_session.cc:1419 Initialize] Serializing optimized model with Graph Optimization level greater than ORT_ENABLE_EXTENDED and the NchwcTransformer enabled. The generated model may contain hardware specific optimizations, and should only be used in the same environment the model was optimized in.
Optimized model has been written at /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1.onnx: ✔
/!\ Optimized model contains hardware specific operators which might not be portable. /!\
As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
WARNING:root:onnxruntime.quantization.quantize is deprecated.
Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator FusedMatMul. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Warning: Unsupported operator Gelu. No schema registered for this operator.
Warning: Unsupported operator LayerNormalization. No schema registered for this operator.
Quantized model has been written at /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1-quantized.onnx: ✔

====== Check exported ONNX model(s) ======
Checking ONNX model loading from: /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1.1-portuguese.onnx ...
Model /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1.1-portuguese.onnx correctly loaded: ✔
Checking ONNX model loading from: /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1.onnx ...
Model /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1.onnx correctly loaded: ✔
Checking ONNX model loading from: /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1-quantized.onnx ...
Model /content/onnx_gpu/pierreguillou-bert-base-cased-squad-v1-quantized.onnx correctly loaded: ✔

Import the onnx quantized version of the model¶

In [42]:

import onnxruntime as ort

# copy/paste the path to the file xxx.quantized.onnx
ort_session = ort.InferenceSession("/content/" + model_checkpoint_onnx)

/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:353: UserWarning: Deprecation warning. This ORT build has ['CUDAExecutionProvider', 'CPUExecutionProvider'] enabled. The next release (ORT 1.10) will require explicitly setting the providers parameter (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.For example, onnxruntime.InferenceSession(..., providers=["CUDAExecutionProvider"], ...)
  "based on the build flags) when instantiating InferenceSession."

Run the onnx model¶

In [43]:

num = 100

total = 0
for i in range(num):
  start = perf_counter()
  outputs = ort_session.run(None, dict(inputs))
  diff = perf_counter() - start
  total += diff

onnx_mean_time_gpu = round((total/num)*1000,2)
print(f'average time: {onnx_mean_time_gpu} ms')

average time: 37.54 ms

Now, we can evaluate the time to get the answer.

In [44]:

num = 100

total = 0
for i in range(num):
  start = perf_counter()
  # code source: https://huggingface.co/transformers/master/task_summary.html#extractive-question-answering

  answer_start_scores = outputs[0]
  answer_end_scores = outputs[1]

  # Get the most likely beginning of answer with the argmax of the score
  answer_start = np.argmax(answer_start_scores)
  # Get the most likely end of answer with the argmax of the score
  answer_end = np.argmax(answer_end_scores) + 1

  input_ids = inputs["input_ids"].tolist()[0]
  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

  diff = perf_counter() - start
  total += diff
  
  # print(f"Question: {question}")
  # print(f"Answer: {answer}")

onnx_mean_time_gpu_answer = round((total/num)*1000,2)
print(f'average time: {onnx_mean_time_gpu_answer} ms')

average time: 0.12 ms

In [45]:

print(f"Question: {question}")
print(f"Answer: {answer}")

Question: Quando começou a pandemia de Covid-19 no mundo?
Answer: 1 de dezembro de 2019

Then, we have the total time when the model is on the GPU:

In [46]:

onnx_total_gpu = round(onnx_mean_tokenizer + onnx_mean_time_gpu + onnx_mean_time_gpu_answer,2)
print(f'time: {onnx_total_gpu} ms')

time: 38.49 ms

3.1.4 Results with ONNX Runtime¶

Let's compage the GPU results with the ones with CPU (check link to the CPU notebook at the beginning of this notebook.

In [47]:

import pandas as pd

raw_data = {
            'Latency on CPU (ms)': [mean_time_cpu, onnx_mean_time_cpu],
            'Latency on GPU (ms)': [mean_time_gpu, onnx_mean_time_gpu],
            }

df = pd.DataFrame(raw_data,
                  index=pd.Index(['PyTorch (without pipeline)','ONNX Runtime']),
                  columns=pd.Index(['Latency on CPU (ms)','Latency on GPU (ms)']))

df

Out[47]:

	Latency on CPU (ms)	Latency on GPU (ms)
PyTorch (without pipeline)	889.07	44.42
ONNX Runtime	711.50	37.54

In [48]:

import numpy as np
import matplotlib.pyplot as plt

labels = ['CPU','GPU']
data = [
        [mean_time_cpu, mean_time_gpu],
        [onnx_mean_time_cpu, onnx_mean_time_gpu]
        ]

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

X = np.arange(2)
ax.bar(X - 0.1, data[0], color = 'r', width = 0.2, label='PyTorch (without pipeline)')
ax.bar(X + 0.1, data[1], color = 'g', width = 0.2, label='ONNX Runtime')

# axes and title
x = np.arange(len(labels))  # the label locations
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Latency (ms)')
ax.set_title('Inference latency (CPU vs GPU) in Google Colab')

leg = ax.legend();

In [49]:

import numpy as np
import matplotlib.pyplot as plt

labels = ['GPU']
data = [mean_time_gpu, onnx_mean_time_gpu]

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

X = np.arange(1)
ax.bar(X - 0.1, data[0], color = 'r', width = 0.2, label='PyTorch (without pipeline)')
ax.bar(X + 0.1, data[1], color = 'g', width = 0.2, label='ONNX Runtime')

# axes and title
x = np.arange(len(labels))  # the label locations
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Latency (ms)')
ax.set_title('Inference latency (GPU) in Google Colab')

leg = ax.legend();

ONNX Runtime does help improve latency both on CPU and GPU, and improves results in comparison to PyTorch.
On GPU, using ONNX Runtime allows inferring up to 1.2 times faster than with PyTorch (38 ms with ONNX Runtime).

3.2 New method | transformers.onnx¶

(source: Configuration-based approach) Transformers v4.9.0 introduces a new package: transformers.onnx. This package allows converting checkpoints to an ONNX graph by leveraging configuration objects. These configuration objects come ready made for a number of model architectures, and are made to be easily extendable to other architectures.

Ready-made configurations include the following models:

ALBERT
BART
BERT
DistilBERT
GPT Neo
LayoutLM
Longformer
mBART
OpenAI GPT-2
RoBERTa
T5
XLM-RoBERTa

Run transformers.onnx (or the conversion script located at transformers/convert_graph_to_onnx.py). This script takes a few arguments such as the model to be exported and the framework you want to export from (PyTorch or TensorFlow).

WARNING: which arguments to use?

It will be exported under onnx/pierreguillou-bert-base-cased-squad-v1.1-portuguese.

In [50]:

%%time
model_checkpoint_onnx = 'onnx/' + model_checkpoint.replace('/','-')

!python -m transformers.onnx --model {model_checkpoint} {model_checkpoint_onnx}

Some weights of the model checkpoint at pierreguillou/bert-base-cased-squad-v1.1-portuguese were not used when initializing BertModel: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at pierreguillou/bert-base-cased-squad-v1.1-portuguese and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using framework PyTorch: 1.9.0+cu111
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
/usr/local/lib/python3.7/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py:353: UserWarning: Deprecation warning. This ORT build has ['CUDAExecutionProvider', 'CPUExecutionProvider'] enabled. The next release (ORT 1.10) will require explicitly setting the providers parameter (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.For example, onnxruntime.InferenceSession(..., providers=["CUDAExecutionProvider"], ...)
  "based on the build flags) when instantiating InferenceSession."
	-[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
	- Validating ONNX Model output "last_hidden_state":
		-[✓] (2, 8, 768) matches (2, 8, 768)
		-[✓] all values close (atol: 0.0001)
	- Validating ONNX Model output "pooler_output":
		-[✓] (2, 768) matches (2, 768)
		-[✓] all values close (atol: 0.0001)
All good, model saved at: onnx/pierreguillou-bert-base-cased-squad-v1.1-portuguese/model.onnx
CPU times: user 182 ms, sys: 86.6 ms, total: 268 ms
Wall time: 23.4 s

The outputs can be obtained by taking a look at the ONNX configuration of each model. For example, for BERT:

In [51]:

from transformers.models.bert import BertOnnxConfig, BertConfig

config = BertConfig()
onnx_config = BertOnnxConfig(config)
output_keys = list(onnx_config.outputs.keys())
output_keys

Out[51]:

['last_hidden_state', 'pooler_output']

WARNING: these outputs do not allow to get an answer!!!!!

We can not continue.

In [52]:

# import onnxruntime as ort
# ort_session = ort.InferenceSession('onnx/pierreguillou-bert-base-cased-squad-v1.1-portuguese/model.onnx')

4. Inference time | TorchScript¶

source: https://huggingface.co/transformers/serialization.html#torchscript

According to Pytorch’s documentation: “TorchScript is a way to create serializable and optimizable models from PyTorch code”. Pytorch’s two modules JIT and TRACE allow the developer to export their model to be re-used in other programs, such as efficiency-oriented C++ programs.

Hugging Face provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused in a different environment than a Pytorch-based python program.

In [53]:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True, torchscript=True)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint, torchscript=True)
model.eval();

4.1 Tokenize the inputs¶

In [54]:

num=100

total = 0
for i in range(num):
  start = perf_counter()
  inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
  diff = perf_counter() - start
  total += diff

torchscript_mean_tokenizer = round((total/num)*1000,2)
print(f'average time: {torchscript_mean_tokenizer} ms')

average time: 0.83 ms

4.2 Model on GPU¶

In [55]:

# put model and inputs to cpu
model = model.to('cuda')
inputs = inputs.to('cuda')

# get mean time
with torch.no_grad():
  traced_model = torch.jit.trace(model, [inputs.input_ids,inputs.token_type_ids,inputs.attention_mask])
  torchscript_mean_time_gpu = round(np.mean([timer(traced_model,inputs.input_ids,inputs.token_type_ids,inputs.attention_mask) for _ in range(100)]))
print(f'{torchscript_mean_time_gpu}ms')

74ms

4.3 Results with TorchScript¶

Let's compage the GPU results with the ones with CPU (check link to the CPU notebook at the beginning of this notebook.

In [56]:

import pandas as pd

raw_data = {
            'Latency on CPU (ms)': [mean_time_cpu, onnx_mean_time_cpu, torchscript_mean_time_cpu],
            'Latency on GPU (ms)': [mean_time_gpu, onnx_mean_time_gpu, torchscript_mean_time_gpu],
            }

df = pd.DataFrame(raw_data,
                  index=pd.Index(['PyTorch (without pipeline)','ONNX Runtime', 'TorchScript']),
                  columns=pd.Index(['Latency on CPU (ms)','Latency on GPU (ms)']))

df

Out[56]:

	Latency on CPU (ms)	Latency on GPU (ms)
PyTorch (without pipeline)	889.07	44.42
ONNX Runtime	711.50	37.54
TorchScript	849.00	74.00

In [57]:

import numpy as np
import matplotlib.pyplot as plt

labels = ['CPU','GPU']
data = [
        [mean_time_cpu, mean_time_gpu],
        [onnx_mean_time_cpu, onnx_mean_time_gpu],
        [torchscript_mean_time_cpu, torchscript_mean_time_gpu]
        ]

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

X = np.arange(2)
ax.bar(X - 0.2, data[0], color = 'r', width = 0.2, label='PyTorch (without pipeline)')
ax.bar(X, data[1], color = 'g', width = 0.2, label='ONNX Runtime')
ax.bar(X + 0.2, data[2], color = 'b', width = 0.2, label='TorchScript')

# axes and title
x = np.arange(len(labels))  # the label locations
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Latency (ms)')
ax.set_title('Inference latency (CPU vs GPU) in Google Colab')

leg = ax.legend();

In [58]:

import numpy as np
import matplotlib.pyplot as plt

labels = ['GPU']
data = [mean_time_gpu, onnx_mean_time_gpu, torchscript_mean_time_gpu]

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

X = np.arange(1)
ax.bar(X - 0.2, data[0], color = 'r', width = 0.2, label='PyTorch (without pipeline)')
ax.bar(X, data[1], color = 'g', width = 0.2, label='ONNX Runtime')
ax.bar(X + 0.2, data[2], color = 'b', width = 0.2, label='TorchScript')

# axes and title
x = np.arange(len(labels))  # the label locations
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.set_ylabel('Latency (ms)')
ax.set_title('Inference latency (GPU) in Google Colab')

leg = ax.legend();

TorchScript does not help improve latency on GPU.
On GPU, it's better to use ONNX Runtime.

Fast inference for Hugging Face tasks models on GPU (for example: QA model)¶

Notebook overview¶

Objective¶

Method for inference¶

References¶

System overview¶

CPU¶

GPU¶

Installation¶

QA model¶

1. Check the model¶

1.1 Without pipeline¶

1.2 With pipeline¶

2. Inference time | PyTorch model¶

2.1 PyTorch model (without pipeline)¶

1. Tokenize the inputs¶

2. Model on GPU¶

2.2 PyTorch model (with pipeline)¶

Model on GPU¶

2.3 Results with PyTorch¶

3. Inference time | ONNX Runtime¶

3.1 Old method | convert_graph_to_onnx.py¶

3.1.1 Tokenize the inputs¶

3.1.2 Model on GPU¶

Installation¶

Convert the transformer model to its quantized onnx version¶

Import the onnx quantized version of the model¶

Run the onnx model¶

3.1.4 Results with ONNX Runtime¶

3.2 New method | transformers.onnx¶

4. Inference time | TorchScript¶

4.1 Tokenize the inputs¶

4.2 Model on GPU¶

4.3 Results with TorchScript¶

END¶