Real-Time Natural Language Understanding with BERT Using TensorRT

Originally published at:

Large scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought about exciting leaps in state-of-the-art accuracy for many natural language understanding (NLU) tasks. Since its release in Oct 2018, BERT1 (Bidirectional Encoder Representations from Transformers) remains one of the most popular language models and still delivers state of the art accuracy at…

Great work! Does this also work on other GPUs like V100 and K80?
Also, what if I have a PyTorch model?

I got an error when pulling the docker container:

root@ubuntu-gpu-7-200gb:/home/ubuntu/TensorRT/demo/BERT# sh python/
Sending build context to Docker daemon 265.7kB
Step 1/17 : FROM
19.05-py3: Pulling from nvidia/tensorrt
7e6591854262: Pulling fs layer
089d60cb4e0a: Pull complete
9c461696bc09: Pull complete
45085432511a: Pull complete
6ca460804a89: Pull complete
2631f04ebf64: Pull complete
86f56e03e071: Pull complete
234646620160: Downloading [====================================> ] 447.9MB/615.2MB
7f717cd17058: Download complete
e69a2ba99832: Download complete
bc9bca17b13c: Download complete
1870788e477f: Download complete
603e0d586945: Downloading [=============================================> ] 452.2MB/492.7MB
717dfedf079c: Download complete
1035ef613bc7: Download complete
c5bd7559c3ad: Download complete
d82c679b8708: Download complete
059d4f560014: Download complete
f3f14cff44df: Download complete
96502bde320c: Download complete
bc5bb9379810: Download complete
e4d8bb046bc2: Download complete
4e2187010a7c: Download complete
9d62684b94c3: Download complete
e70e61e48991: Download complete
adecb91612fe: Download complete
ba27dafb70e8: Download complete
16bde716c9b2: Download complete
476faeed0740: Download complete
5af7c8a6b101: Download complete
960591fee98d: Download complete
0dd138c184ff: Download complete
7ef953567062: Downloading
bd9a54f5a193: Waiting
144852c40661: Waiting
171a26eec2d4: Waiting
999acb71c4df: Waiting
3f301e4ba386: Waiting
3fc30e0f9cba: Waiting
38d1459042f4: Waiting
aafa1a9d16eb: Waiting
unauthorized: authentication required
Unable to find image 'bert-tensorrt:latest' locally
docker: Error response from daemon: pull access denied for bert-tensorrt, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.

Any ideas ?

good one so helpful

The instructions include:

python python/ -m /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/model.ckpt-8144 -o bert_base_384.engine -b 1 -s 384 -c /workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2

however, the defaults appear to be configured to work with BERT large. The following change allows the steps to all complete without error:

+++ b/demo/BERT/python/
@@ -16,9 +16,9 @@

# Setup default parameters (if no command-line parameters given)

SCRIPT=$(readlink -f "$0")
SCRIPT_DIR=$(dirname ${SCRIPT})

Great work. I ran into the following problem, running the fourth step above:

FileNotFoundError: [Errno 2] No such file or directory: '/workspace/models/fine-tuned/bert_tf_v2_base_fp16_384_v2/bert_config.json'

Ted. Hi ! It seemed you got past my issue if "/workspace/.." directory not being found. How did you get past that?

OK, I solved my own problem. It works great now!

I had 2 issues. 1) The example script downloads a different model, so you might need to adjust it 2) It can take a while to create the "engine" file - at least it did for me :)

I solved it by downloading the right file and fixing the example.

get your API key from and then try `docker login`.

Really nice work done guys. In the explanation it is stated that the input and output of the fully connected layers is B x S x (N * H). However i have the PyTorch implementation of BERT from NVIDIA and it seems that the input and output of the Fully connected layers is just B x S x H. Below is a part of output of print(model).

(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(softmax): Softmax(dim=-1)

Also the BERT config file is .
1 {
2 attention_probs_dropout_prob: 0.1,
3 hidden_act: gelu,
4 hidden_dropout_prob: 0.1,
5 hidden_size: 1024,
6 initializer_range: 0.02,
7 intermediate_size: 4096,
8 max_position_embeddings: 512,
9 num_attention_heads: 16,
10 num_hidden_layers: 24,
11 type_vocab_size: 2,
12 vocab_size: 30522
13 }

When run the “cd TensorRT/demo/BERT && sh python/” got the problem: "Error: ‘nvidia/bert_tf_v2_base_fp16_384:2’ could not be found. "

And on the ngc I didn’t find the model with this name.

Any one can help? Where can I download the fine tuned weight for t his now? Thanks in advance.