How to improve CUDA performance with `Low memory throughput` get from nvvp?

Hello everyone,

I’m working on a project with writing DLL with TensorRT to speed up inference process in C#.
I put my project in nvvp to see if there’s any chance to improve the performance,
and I got the result like this:

I tested the inference time in different section of my code, and found out that,
while I’m copying data from Host to device, it only took for at most 5ms,
but if I copy data from Device to host, it took for about 32~37ms,
So I change this part from cudaMalloc() to cudaHostAlloc(),
and the warning Low Kernel/Memcpy Efficiency disappeared after the change,
the rest remained.(But the time consumption of Device to Host didn’t improve…)

As far as I know, since I only run one picture per inference,
I only have to deal with the part Low Memcpy Throughput.(?
But after some study, I still have no idea how to improve this part in my code…
(And the bandwidth seem even smaller than the previous one…)

I’ve try to run CUDA’s example to see if my GPU works well, and it seems fine:

PS: here’s the code of DLL with TensorRT
DLL_code.zip

Thanks in advance for any help or advice!

PS:
The input size in my case is 1920 x 1920 x 1,
and the output size is 1920 x 1920 x 2.