Strange CNN inference latency behavior with CUDA and TensorRT

Description

When using CNNs on my GPU, I’m getting a strange latency increase, if the last inference was >= 15s ago. The jitter also explodes. Please have a look at the graph included.
I’m using the TensorRT backend of OnnxRuntime, I double checked with their CUDA-Module - which provides the same latency anomaly, but with a bigger baseline latency.

Things which I took care of:

  1. Fixing GPU clocks & fan speed at maximum.
  2. Working with isolated CPUs
  3. Disabling all energy saving options for the PC platform.
  4. Separate GPU for display output

I’d like to know, if there is any HW-Schedulder on the GPU, which might be responsible for these dropouts? Do you have any ideas of what could possibly create such a strange behaviour?


The term “Input Latency” refers to the period of two inferences.
The term “Output Latency” refers to the overall latency of the inference.

Environment

TensorRT Version: 8.4.2.4
GPU Type: Quadro RTX 4000
Nvidia Driver Version: 515.43.04
CUDA Version: 11.7
CUDNN Version: 8.4.1
Operating System + Version: Debian 11, Kernel 5.10 PREEMPT_RT

Steps To Reproduce

It’s not easy to reproduce. If necessary, I provide a C++ snippet for that

Hi,

Could you please try on the latest TensorRT version 8.5.1 and let us know if you still face this issue.
If possible could you please share with us the minimum issue repro model/script for better debugging.

Thank you.

Hi,
thank you for your reply.

I doubt the problem arises from the used TensorRT version, because of my counter measurements with the Onnxruntime CUDA module - which shows the same behaviour but with more jitter.

In my opinion we are searching for some scheduling things or similar in hardware or driver.

Upgrading to the most recent version of TensorRT will cost much time. I’ll give it a shot, but it will take some time until I have results.

Creating a debugging repro will take some time, too - but I will come back when it’s done.

Hi,
I created a minimal example for testing. It’s based on the TensorRT samples, I just altered the code of the MNIST-ONNX sample. Using this example, I could reproduce that behaviour. With my RTX3080, the latency increase is worse compared to the Quadro RTX4000.
I tested on a whole different machine, with energy savings enabled. The measurements were carried out with the more compute intensive ResNet-101. In the graph are only 75 samples, but the behaviour is clearly visible.

Environment

TensorRT Version: 8.5.2.2
GPU Type: GeForce RTX3080
Nvidia Driver Version: 520.61.05
CUDA Version: 11.8
CUDNN Version: 8.7.0
Operating System + Version: Debian 11, Kernel 6.0 SMP PREEMPT_DYNAMIC

trt_minimal_v2.7z (45.3 KB)

TLDR:
Different system, different GPU and most recent driver and sdk versions. Problem still existing.

I carried out some measurements with a windows 10 machine:

It has the same behaviour with ResNet-18.

Environment

TensorRT Version: 8.5.2.2
GPU Type: Quadro RTX 4000
Operating System + Version: Windows 10

Hi,

Sorry for the delay in response. We are looking into this internally. Will get back to you.

Thank you.

Hi,

thank you for investigating this! I’m really looking forward on your findings.

I would have have liked to try a recent Tesla GPU, but unfortunately we don’t have one due to their price tags.
It would be interesting for us, if they might perform better in these scenarios, because maybe our final product would allow us to afford one.

Hi,

It’s likely not a TRT bug but a CUDA driver issue.
Could you share details on how you locked the GPU clocks? Did you use sudo nvidia-smi -lgc <sm_clk> to lock it?

Thank you.

Hi,

really interesting!

I provided the script “launchTest.sh” - I tweaked until the GPU never really lowered it’s clocks.
For the RTX3080 it’s for Linux:

id=0
nvidia-smi -i $id -pm ENABLED

#fan 100%
nvidia-settings -a ‘[gpu:’$id’]/GPUFanControlState=1’
nvidia-settings -a ‘[fan:0]/GPUTargetFanSpeed=100’
nvidia-settings -a ‘[fan:1]/GPUTargetFanSpeed=100’

#powermizer (Maximum Performance)
nvidia-settings -a ‘[gpu:’$id’]/GPUPowerMizerMode=1’

#clocks (Please find the maximum clocks for your card via <nvidia-smi -q -d SUPPORTED_CLOCKS>)
nvidia-smi -ac 9501,2100 -i $id
nvidia-smi --lock-gpu-clocks=2100 -i $id
nvidia-smi --lock-memory-clocks=9501 -i $id
nvidia-smi -q -d CLOCK -i $id

Hi @BitShifter,

Could you please share complete output logs (stdout outputs) when you execute the above script.
We want to see if the clock is really locked or if it is prohibited because it is a GeForce GPU.

Thank you.

Hi,

I did it for both GPUs for completeness.
logrtx4000.log (7.9 KB)
logrtx3080.log (7.9 KB)

Hi @spolisetty,

it’s been a while since your last post. Could you find something?

Thanks :)