Inference doesn't speed up if not continuous after initialization

Description

I don’t know how to describe this situation briefly in the topic…
I’m making a DLL with TensorRT to call from C# for speeding up the inference process.

I called the function init_inference() and inf_dllTRT_c1() for inference from C#.
the json file is for some setting like model path, size of model…etc. for DLL.
(You can see the codes in the appendix zip file below for more details)

The thing is, I’ve run an inference while initializing the settings in the init_inference() function in my code,
and it shows that, the very-first time’s inference took longer time than the afterwards inference.
e.g.
If I run the DoInference() function for 5 times,
the first time took about 37ms,
and the 2, 3, 4, 5 time took about 27~32 ms for each.

If this case is in an .exe file,
it can work properly with the speed up inference after the first run;
but if it’s in a DLL,
it appeared that the speed up inference only last for only a few run(after a continuous run for several time in a for loop),
and the latter’s run turned to the speed with the very-first run while initializing.

Is this caused by some object didn’t reuse during the inference process in my code?

Environment

TensorRT Version: 7.0.0.11
GPU Type: RTX 2080TI
Nvidia Driver Version: 451.82
CUDA Version: 10.0
CUDNN Version: 7.6.2
Operating System + Version: Windows10 1903
Python Version (if applicable): 3.7.0
TensorFlow Version (if applicable): 1.13.1
PyTorch Version (if applicable): 1.6(?
Baremetal or Container (if container which image + tag): -

Relevant Files

Seems like installing cognex is necessary to run my code in C#,
so I’ll just provide the .cs file instead of the project file, Here’s the link:
Ask_in_forum_cocoyen1995.zip

Thanks in advance for any help or advice!

Hi @cocoyen1995,
You can profile the application to find out what is happening.
Also are you running a large enough workload so that the GPU isn’t just finishing quick and powering down?

Thanks!

Hi,
Thanks for your reply!
I’m not sure do you mean using “nvvp”?
I’ve tried to use that with my program, and the result is like this:

The left part’s cudaMemcpyAsync took 29 ms, and the right part’s took about 36 ms.
I didn’t see any different usage through the graph(?),
but most of the consuming time of each process on the right in stream38 are longer than the ones on the left side.
Here is the .nvvp file:
profile.nvvp

As for the workload, I’m not sure how to measure that,
my output data is a float array with size 1920 x 1920 x 2…
(Where most of the time difference happened…)
I’m not sure if this is an inevitable situation with this workload(?)
(Since if I run inference repeatedly in a for loop first,
the declined time consumption will only last for a few runs afterwards if not in a for loop…)

Thanks again for your help and hope to hear from you soon!

Hi @cocoyen1995,
This doesnt look like a TRT issue,
Looks like you are thermal throttling your GPU somehow, causing it to clock down and get lower performance.

Thanks!