TensorRT on RTX 3080 slow down

We are using tensorRT on RTX3080 GPU to inference unet.

At the beginning, TensorRT unet takes only 1ms. However, after some iterations, the inference becomes slower. It reaches 3ms afterwards, which is slower.

The GPU temperature is 35C degree, not too hot. The GPU usage is not more than 30%, and the GPU memory is less than 1500MB. Could someone provide some insights regarding this issue? Thanks in advance.

We use:
Cudnn 8.2.3
TensorRT 8.4
Windows 10
CUDA 11.6
RTX 3080

the code is as below:

cudaStream_t stream;
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, input_size * sizeof(float), cudaMemcpyHostToDevice, stream));
enqueueV2(buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], output_size * sizeof(float), cudaMemcpyDeviceToHost, stream));

Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:


We can use mobilenet to get the strange result.
The onnx is as follows:

We use trtexec to convert and run inference.
To convert:
trtexec --saveEngine=mobile.trt --onnx=mobilenetv2-7.onnx

To run inference:
trtexec --loadEngine=mobile.trt --dumpProfile --duration=0 --warmUp=0 --sleepTime=20 --idleTime=20 --verbose --iterations=N

When N=10, inference takes about 2.1 ms.
When N=100, inference takes about 2.1 ms.
When N=1000, inference takes about 3.6 ms.

As N become large, the model slow down. Really strange.


We couldn’t get similar behavior. Could you please share with us verbose logs for all of the above.

Thank you.