Performance regression found in TensorRT 8.6.1 when running BERT on GPU T4

niuzheng168 · August 1, 2023, 6:17am

Description

We are using TensorRT via Triton to inference a BERT classification model.
Our model is simply a BERT base + classifier.
Previous we are using TRT 8.5.2, at tail of trtexec, the log shows

Average on 10 runs - GPU latency: 1.52078 ms - Host latency: 1.53953 ms (enqueue 1.5259 ms)

But after we upgrade to TRT 8.6.1, at tail of trtexec, the log shows
Average on 10 runs - GPU latency: 4.13606 ms - Host latency: 4.15786 ms (enqueue 0.541821 ms)

It shows alomost 3x latency regression, the source onnx model are same.
it seems the --fp16 didn’t work, I use nsys profile and see the 8.6.1 is always using fp32 cuda apis

you can see it’s using fp32 in 8.6.1

but it’s using fp16 in 8.5.2

attached trtexec logs and nsight logs

Environment

TensorRT Version: 8.6.1
GPU Type: T4
Nvidia Driver Version: 525.105.17
CUDA Version: 12.0
CUDNN Version:
Operating System + Version: Ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
TextAnalyzerV2.t4.852.trt.log (1.4 MB)
TextAnalyzerV2.t4.861.trt.log (1.3 MB)
trt852.nsys-rep (735.7 KB)
trt861.nsys-rep (1.2 MB)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

AakankshaS · August 1, 2023, 11:37am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

niuzheng168 · August 1, 2023, 3:03pm

repro step:

start docker: docker run --gpus=0 --rm -it --name test_trt1 -v /home/zhn:/home/zhn nvcr.io/nvidia/tensorrt:23.07-py3
run trt exec: trtexec --onnx=model.test.op12.fp16.onnx --saveEnginemodel.test.op12.fp16.861.trt --fp16 --memPoolSize=workspace:10000 --minShapes=input_ids:1x1,input_mask:1x1 --maxShapes=input_ids:1x512,input_mask:1x512 --optShapes=input_ids:1x64,input_mask:1x64 --device=0

you can get my onnx model here model.test.nomask.op12.fp16.onnx - Google Drive

niuzheng168 · August 3, 2023, 8:01am

Hi @AakankshaS could you repro my issue by repro steps above?

spolisetty · August 4, 2023, 11:59am

Hi @niuzheng168 ,

We could reproduce the issue. Please allow us some time to work on this.
Thank you for reporting it to us.

Topic		Replies	Views
Tensorrt can not speed up well TensorRT	7	1753	June 29, 2022
Performance drop for large batches and float16 TensorRT	2	725	July 15, 2019
FP16 does not improve latency of ssd_mobilenet_v2_coco TensorRT	0	516	August 9, 2019
Int8 performance is less than fp16 TensorRT tensorrt	3	923	September 2, 2022
FP16 not even two times faster than using FP32 in TensorRT TensorRT	0	683	June 12, 2019
half float can't accelerate in tensorRT. Jetson TX2	2	673	October 18, 2021
Graph conversion to FP16 not working TensorRT	6	1642	October 12, 2021
TensorRT, result error in fp16 TensorRT	1	751	October 19, 2021
Same inference speed for INT8 and FP16 TensorRT	10	6133	October 12, 2021
No performance difference between Float16 and Float32 optimized TensorRT models Jetson AGX Xavier tensorrt	4	3311	October 10, 2021