Performance regression found in TensorRT 8.6.1 when running BERT on GPU T4

Description

We are using TensorRT via Triton to inference a BERT classification model.
Our model is simply a BERT base + classifier.
Previous we are using TRT 8.5.2, at tail of trtexec, the log shows

Average on 10 runs - GPU latency: 1.52078 ms - Host latency: 1.53953 ms (enqueue 1.5259 ms)

But after we upgrade to TRT 8.6.1, at tail of trtexec, the log shows
Average on 10 runs - GPU latency: 4.13606 ms - Host latency: 4.15786 ms (enqueue 0.541821 ms)

It shows alomost 3x latency regression, the source onnx model are same.
it seems the --fp16 didn’t work, I use nsys profile and see the 8.6.1 is always using fp32 cuda apis

you can see it’s using fp32 in 8.6.1

but it’s using fp16 in 8.5.2

attached trtexec logs and nsight logs

Environment

TensorRT Version: 8.6.1
GPU Type: T4
Nvidia Driver Version: 525.105.17
CUDA Version: 12.0
CUDNN Version:
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
TextAnalyzerV2.t4.852.trt.log (1.4 MB)
TextAnalyzerV2.t4.861.trt.log (1.3 MB)
trt852.nsys-rep (735.7 KB)
trt861.nsys-rep (1.2 MB)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

repro step:

  1. start docker: docker run --gpus=0 --rm -it --name test_trt1 -v /home/zhn:/home/zhn nvcr.io/nvidia/tensorrt:23.07-py3
  2. run trt exec: trtexec --onnx=model.test.op12.fp16.onnx --saveEnginemodel.test.op12.fp16.861.trt --fp16 --memPoolSize=workspace:10000 --minShapes=input_ids:1x1,input_mask:1x1 --maxShapes=input_ids:1x512,input_mask:1x512 --optShapes=input_ids:1x64,input_mask:1x64 --device=0

you can get my onnx model here model.test.nomask.op12.fp16.onnx - Google Drive

Hi @AakankshaS could you repro my issue by repro steps above?

Hi @niuzheng168 ,

We could reproduce the issue. Please allow us some time to work on this.
Thank you for reporting it to us.

1 Like