Significant latency regression when compiling deformable attention layers in TensorRT 10.5

aboubezari · October 23, 2024, 5:40pm

Description

There is a significant latency regression in a standard deformable attention layer when upgrading from TRT 8.6.1.6 to TRT 10.5.0.18.

Benchmarking the same compiled ONNX model with trtexec shows a noticeable regression from TRT 8.6 to TRT 10.5.

TRT 8.6:

[I] GPU Compute Time: min = 1.84009 ms, max = 2.03879 ms, mean = 1.90564 ms, median = 1.90161 ms, percentile(90%) = 1.91577 ms, percentile(95%) = 1.92102 ms, percentile(99%) = 2.03264 ms
[I] Total GPU Compute Time: 1.84276 s

TRT 10.5:

[I] GPU Compute Time: min = 3.37921 ms, max = 3.78674 ms, mean = 3.44999 ms, median = 3.44165 ms, percentile(90%) = 3.52563 ms, percentile(95%) = 3.53076 ms, percentile(99%) = 3.55737 ms
[I] Total GPU Compute Time: 2.79104 s

Environment

TensorRT Version: 10.5.0.18
GPU Type: A4000
Nvidia Driver Version: 535.183.01
CUDA Version: 12.2
CUDNN Version: 8.9
Operating System + Version: Linux x86, Ubuntu 20.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): N/A

Relevant Files

Simple python file to reproduce our TensorRT compilation process: deformable_attn_compile.py · GitHub

ONNX & TRT engine files:
trt_bug_report2.zip (3.5 MB)

deformable_attention_10k_queries.onnx: Deformable attention layer ONNX model file. Opset 17, ONNX version 1.16.0
deformable_attention_10k_queries_trt105.trt: Deformable attention layer TRT engine for TRT 10.5.0.18.
deformable_attention_10k_queries_trt86.trt: Deformable attention layer TRT engine TRT 8.6.1.6.

Steps To Reproduce

The provided python file can be used to compile the the onnx models. It uses the standard TensorRT python import and has no other dependencies other than python 3.10. We have already compiled the two engines for you for both TRT 10.5 and 8.6 and included them above.

# Same code can be used for both TRT 8.6 & TRT 10.5 python versions.
python3 deformable_attn_compile.py

Benchmarking with trtexec shows the slowdown.

# TRT 8.6 benchmark
trtexec --loadEngine=deformable_attention_10k_queries_trt86.trt


# TRT 10.5 benchmark
trtexec --loadEngine=deformable_attention_10k_queries_trt105.trt

AakankshaS · November 30, 2024, 10:47am

Hi @aboubezari ,
Can you pls sahre the onnx model with us?

Thanks

Topic		Replies	Views
Performance regression found in TensorRT 8.6.1 when running BERT on GPU T4 TensorRT	4	634	August 4, 2023
tensorRT inference unstable compared onnxruntime TensorRT	4	1289	May 4, 2021
Issues while converting ONNX to TRT Jetson Nano tensorrt , onnx	9	1258	October 18, 2021
ONNX to TRT Serialization Error TensorRT	1	473	October 15, 2020
Error loading .trt model Jetson AGX Orin tensorrt	7	77	November 6, 2024
Latency linearly increases when increased batch size or concurrent models TensorRT inference-server-triton	15	2018	September 29, 2021
TensorRT --- non-int8 fallback when trying to calibrate ONNX model DeepStream SDK tensorrt , deepstream	11	418	July 1, 2024
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	821	January 26, 2022
Inference time is not improving with the increase in batch size TensorRT	8	1747	June 1, 2022
AssertionError: Max workspace size for TensorRT inference should be positive, got 0 TensorRT	4	726	July 21, 2021

Significant latency regression when compiling deformable attention layers in TensorRT 10.5

Description

Environment

Relevant Files

Steps To Reproduce

Related topics