Transfer data from GPU->CPU takes too much time.

Hi guys,

I met data transferring problem on jetson TX2
When i run inference(data from CPU to GPU ,inference ,data from GPU to CPU) on jetson TX2 based on my deep learning network(onnx format) , i found that transferring data from GPU to CPU takes a lot of time. It took up about 80% of the time.

The size of data needed to transfer is 1x17x80x64. TensorRT version: 5.1; Linux version: ubantu 18.04. Copy function i using is cudaMemcpyAsync();

Maybe i can optimize this processing by following ways, but there still are some issues wanted to solve:

  1. I can use pinned memory to improve memory copy times, is it an efficient way and how i can implement?
  2. In fact ,i will process those data(1x17x80x64) to 1x2x17 by function which is implemented by “C++” after transfer data to CPU, i
    might implement this function by cuda in order to run on GPU ,then just transfer small size data(1x2x17). Is it a efficient
    way to optimize? Can you provide some links to help to implement my function in cuda?

I would appreciate it if you have any advices and help!