I’m trying to replicate the steps and results described in this article https://devblogs.nvidia.com/nlu-with-tensorrt-bert/ which shows how to optimize the BERT model using TensorRT in order to achieve a 2.2ms inference time on the Squad task using a T4 GPU.
So far, I’ve managed to run the scripts build_examples.sh, bert_builder.py, and bert_inference.py given in the repo https://github.com/NVIDIA/TensorRT/tree/release/5.1/demo/BERT/python outside of a docker container. I assume the scripts work as intended since I get the answer “high performance deep learning inference platform” when I type the command:
python python/bert_inference.py -e bert_base_128.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v /workspace/models/fine-tuned/bert_tf_v2_base_fp16_128_v2/vocab.txt -b 1
However, I also get in the output “Running inference in 204.970 Sentences/Sec”, which corresponds to 4.9ms per sentence, which is twice the claimed inference time of 2.2ms.
Machine configuration (Google Cloud VM):
- 16 vCPUs
- 60 GB memory
- GPU: 1 NVIDIA Tesla T4
I work on the model bert_tf_v2_base_fp16_128_v2 with a batch size of 1.
Can you please tell me what I’m missing in order to get the 2.2ms inference time?