I’m working on a project with writing DLL with TensorRT to speed up inference process in C#.
I put my project in nvvp to see if there’s any chance to improve the performance,
and I got the result like this:
I tested the inference time in different section of my code, and found out that,
while I’m copying data from Host to device, it only took for at most 5ms,
but if I copy data from Device to host, it took for about 32~37ms,
So I change this part from
and the warning
Low Kernel/Memcpy Efficiency disappeared after the change,
the rest remained.(But the time consumption of Device to Host didn’t improve…)
As far as I know, since I only run one picture per inference,
I only have to deal with the part
Low Memcpy Throughput.(?
But after some study, I still have no idea how to improve this part in my code…
(And the bandwidth seem even smaller than the previous one…)
I’ve try to run CUDA’s example to see if my GPU works well, and it seems fine:
PS: here’s the code of DLL with TensorRT
Thanks in advance for any help or advice!
The input size in my case is 1920 x 1920 x 1,
and the output size is 1920 x 1920 x 2.