Replicate 2.2ms inference time on BERT

Hello,

I’m trying to replicate the steps and results described in this article https://devblogs.nvidia.com/nlu-with-tensorrt-bert/ which shows how to optimize the BERT model using TensorRT in order to achieve a 2.2ms inference time on the Squad task using a T4 GPU.

So far, I’ve managed to run the scripts build_examples.sh, bert_builder.py, and bert_inference.py given in the repo TensorRT/demo/BERT/python at release/5.1 · NVIDIA/TensorRT · GitHub outside of a docker container. I assume the scripts work as intended since I get the answer “high performance deep learning inference platform” when I type the command:

python python/bert_inference.py -e bert_base_128.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v /workspace/models/fine-tuned/bert_tf_v2_base_fp16_128_v2/vocab.txt -b 1

However, I also get in the output “Running inference in 204.970 Sentences/Sec”, which corresponds to 4.9ms per sentence, which is twice the claimed inference time of 2.2ms.

Machine configuration (Google Cloud VM):

  • 16 vCPUs
  • 60 GB memory
  • GPU: 1 NVIDIA Tesla T4

I work on the model bert_tf_v2_base_fp16_128_v2 with a batch size of 1.

Can you please tell me what I’m missing in order to get the 2.2ms inference time?

Thank you!

I was able to replicate the 2.2ms inference time using the C++ implementation
Instructions here: https://github.com/NVIDIA/TensorRT/tree/release/6.0/demo/BERT

&&&& RUNNING TensorRT.sample_bert # build/sample_bert -d /root/tensorrt/squad_output_path -d /root/tensorrt/data_dump --fp16 --nheads 12
	[09/10/2019-18:33:48] [I] Number of parameters: 14
	[09/10/2019-18:33:48] [I] Number of buffers: 3
	[09/10/2019-18:33:48] [I] Number of parameters: 199
	[09/10/2019-18:33:50] [I] Building Engine...
	[09/10/2019-18:36:18] [I] [TRT] Detected 3 inputs and 1 output network tensors.
	[09/10/2019-18:36:19] [I] Done building engine.
	[09/10/2019-18:36:19] [I] Run 0; Total: 2.27168ms Comp.only: 2.23968ms
	[09/10/2019-18:36:19] [I] Run 1; Total: 2.22134ms Comp.only: 2.19075ms
	[09/10/2019-18:36:19] [I] Run 2; Total: 2.20317ms Comp.only: 2.17648ms
	[09/10/2019-18:36:19] [I] Run 3; Total: 2.19834ms Comp.only: 2.17386ms
	[09/10/2019-18:36:19] [I] Run 4; Total: 2.20205ms Comp.only: 2.17117ms
	[09/10/2019-18:36:19] [I] Run 5; Total: 2.21222ms Comp.only: 2.184ms
	[09/10/2019-18:36:19] [I] Run 6; Total: 2.20266ms Comp.only: 2.17357ms
	[09/10/2019-18:36:19] [I] Run 7; Total: 2.19555ms Comp.only: 2.17088ms
	[09/10/2019-18:36:19] [I] Run 8; Total: 2.2353ms Comp.only: 2.2023ms
	[09/10/2019-18:36:19] [I] Run 9; Total: 2.34566ms Comp.only: 2.31424ms
	B=1 S=128 MAE=4.079233389348e-03 MaxDiff=1.364612579346e-02  Runtime(total avg)=2.228797ms Runtime(comp ms)=2.199693
	&&&& PASSED TensorRT.sample_bert # build/sample_bert -d /root/tensorrt/squad_output_path -d /root/tensorrt/data_dump --fp16 --nheads 12

Thank you for your response! I indeed only tried with the Python implementation. I’ll try will the C++ one and hopefully I’ll get the same results.