Strange CNN inference latency behavior with CUDA and TensorRT

BitShifter · December 19, 2022, 2:54pm

Description

When using CNNs on my GPU, I’m getting a strange latency increase, if the last inference was >= 15s ago. The jitter also explodes. Please have a look at the graph included.
I’m using the TensorRT backend of OnnxRuntime, I double checked with their CUDA-Module - which provides the same latency anomaly, but with a bigger baseline latency.

Things which I took care of:

Fixing GPU clocks & fan speed at maximum.
Working with isolated CPUs
Disabling all energy saving options for the PC platform.
Separate GPU for display output

I’d like to know, if there is any HW-Schedulder on the GPU, which might be responsible for these dropouts? Do you have any ideas of what could possibly create such a strange behaviour?

The term “Input Latency” refers to the period of two inferences.
The term “Output Latency” refers to the overall latency of the inference.

Environment

TensorRT Version: 8.4.2.4
GPU Type: Quadro RTX 4000
Nvidia Driver Version: 515.43.04
CUDA Version: 11.7
CUDNN Version: 8.4.1
Operating System + Version: Debian 11, Kernel 5.10 PREEMPT_RT

Steps To Reproduce

It’s not easy to reproduce. If necessary, I provide a C++ snippet for that

spolisetty · December 21, 2022, 9:58am

Hi,

Could you please try on the latest TensorRT version 8.5.1 and let us know if you still face this issue.
If possible could you please share with us the minimum issue repro model/script for better debugging.

Thank you.

BitShifter · December 21, 2022, 10:09am

Hi,
thank you for your reply.

I doubt the problem arises from the used TensorRT version, because of my counter measurements with the Onnxruntime CUDA module - which shows the same behaviour but with more jitter.

In my opinion we are searching for some scheduling things or similar in hardware or driver.

Upgrading to the most recent version of TensorRT will cost much time. I’ll give it a shot, but it will take some time until I have results.

Creating a debugging repro will take some time, too - but I will come back when it’s done.

BitShifter · December 23, 2022, 9:47am

Hi,
I created a minimal example for testing. It’s based on the TensorRT samples, I just altered the code of the MNIST-ONNX sample. Using this example, I could reproduce that behaviour. With my RTX3080, the latency increase is worse compared to the Quadro RTX4000.
I tested on a whole different machine, with energy savings enabled. The measurements were carried out with the more compute intensive ResNet-101. In the graph are only 75 samples, but the behaviour is clearly visible.

Environment

TensorRT Version: 8.5.2.2
GPU Type: GeForce RTX3080
Nvidia Driver Version: 520.61.05
CUDA Version: 11.8
CUDNN Version: 8.7.0
Operating System + Version: Debian 11, Kernel 6.0 SMP PREEMPT_DYNAMIC

trt_minimal_v2.7z (45.3 KB)

TLDR:
Different system, different GPU and most recent driver and sdk versions. Problem still existing.

BitShifter · January 9, 2023, 9:46am

I carried out some measurements with a windows 10 machine:

It has the same behaviour with ResNet-18.

Environment

TensorRT Version: 8.5.2.2
GPU Type: Quadro RTX 4000
Operating System + Version: Windows 10

spolisetty · January 9, 2023, 6:38pm

Hi,

Sorry for the delay in response. We are looking into this internally. Will get back to you.

Thank you.

BitShifter · January 10, 2023, 7:14am

Hi,

thank you for investigating this! I’m really looking forward on your findings.

I would have have liked to try a recent Tesla GPU, but unfortunately we don’t have one due to their price tags.
It would be interesting for us, if they might perform better in these scenarios, because maybe our final product would allow us to afford one.

spolisetty · January 10, 2023, 10:04am

Hi,

It’s likely not a TRT bug but a CUDA driver issue.
Could you share details on how you locked the GPU clocks? Did you use sudo nvidia-smi -lgc <sm_clk> to lock it?

Thank you.

BitShifter · January 10, 2023, 11:20am

Hi,

really interesting!

I provided the script “launchTest.sh” - I tweaked until the GPU never really lowered it’s clocks.
For the RTX3080 it’s for Linux:

id=0
nvidia-smi -i $id -pm ENABLED

#fan 100%
nvidia-settings -a ‘[gpu:’$id’]/GPUFanControlState=1’
nvidia-settings -a ‘[fan:0]/GPUTargetFanSpeed=100’
nvidia-settings -a ‘[fan:1]/GPUTargetFanSpeed=100’

#powermizer (Maximum Performance)
nvidia-settings -a ‘[gpu:’$id’]/GPUPowerMizerMode=1’

#clocks (Please find the maximum clocks for your card via <nvidia-smi -q -d SUPPORTED_CLOCKS>)
nvidia-smi -ac 9501,2100 -i $id
nvidia-smi --lock-gpu-clocks=2100 -i $id
nvidia-smi --lock-memory-clocks=9501 -i $id
nvidia-smi -q -d CLOCK -i $id

spolisetty · January 11, 2023, 5:32am

Hi @BitShifter,

Could you please share complete output logs (stdout outputs) when you execute the above script.
We want to see if the clock is really locked or if it is prohibited because it is a GeForce GPU.

Thank you.

BitShifter · January 11, 2023, 8:17am

Hi,

I did it for both GPUs for completeness.
logrtx4000.log (7.9 KB)
logrtx3080.log (7.9 KB)

BitShifter · January 30, 2023, 6:56am

Hi @spolisetty,

it’s been a while since your last post. Could you find something?

Thanks :)

BitShifter · March 8, 2023, 8:00am

Hi @spolisetty,

two months went by, could you find something? Do we have a possible fix or workaround?

BitShifter · January 24, 2024, 8:30am

Hello,

could you find the root cause or a workaround?
Is the problem fixed in newer driver versions?

Thank you

Topic		Replies	Views
TensorRT execution inference time occasionally increases dramatically after the warmup TensorRT	1	1749	January 7, 2022
TensorRT on RTX 3080 slow down TensorRT tensorrt	6	2126	September 16, 2022
TensorRT inference time issues with different driver version TensorRT	1	449	September 20, 2023
BIggest Latency in TensorRT TensorRT cudnn	1	341	October 19, 2023
The time of cudaMemcpyAsync() in cudaMemcpyDeviceToHost mode is unstable after upgrading the graphics card driver to version 512 CUDA Setup and Installation tensorrt	1	791	January 23, 2024
Long Cuda Synchronization times in TensorRT inference (Python API) TensorRT tensorrt , cuda , python , cudnn	3	96	September 1, 2025
Tensorrt inference time fluctuated when test a big model TensorRT tensorrt	2	752	June 4, 2021
TensorRT inference time extremely slow TensorRT	1	502	January 31, 2023
The time of cudaMemcpyAsync() in cudaMemcpyDeviceToHost mode is unstable after upgrading the graphics card driver to version 512 TensorRT tensorrt	1	673	October 6, 2022
TensorRT enqueueV2 take a long time TensorRT cudnn	4	674	January 30, 2024

Strange CNN inference latency behavior with CUDA and TensorRT

Description

Environment

Steps To Reproduce

Environment

Environment

Related topics