Transfer data from GPU to CPU takes too much times on TX2


I met data transferring problem on jetson TX2.

When i run inference(data from CPU to GPU,inference,data from GPU to CPU) on jetson TX2 based on my network(onnx format),i found that transferring data from GPU to CPU takes a lot of time(70ms).It took up about 80% of the inference time.

The size of data needed to transfer is 1x17x80x64. TensorRT version:, Linux version: ubantu 18.04.Copy function i using is cudaMemcpyAsync().

Maybe i can optimize this processing by following ways,but there still are some issues waiting to solve:

1.I can use pinned memory to improve memory copy times,but it looks like that it can not speed up my processing time.
2.In fact,i will process those data(1x17x80x64) to 1x2x17 by function which is implemented by “C++” after transfer data to CPU.I might implement this function by cuda in order to run on GPU,then just transfer small size data. So, can you provide some sources or links to help implement my function in cuda or tensorRT?

I would appreciate it if you have any advices and help!


Please remember to maximize the system performance first.
Memory copy should just take few milliseconds.

  1. It’s recommended to use unified memory.
    Here is a sample for your reference:

  2. Which kind of process do you want? Is it a convolution operation?
    If yes, here is a sample for your reference:



Thank you very much for your help.

  1. I tried to use unified memory. Maybe i have to include . But when i want to implement it, the compiler give a error like "fatal error: cudaMappedMemory.h No such file or directory". Do I need to configure any new packages? Actually, I'm a newcomer to tensorRT.
  2. I'd like to explain about my operation.I want to find the maximum value and its position(N x 17 x3) from data in GPU(N x 17 x 80 x 64), then execute add or subtract on maximum value and its position and then transfer to the CPU. By contrast, the amount of data that needs to be transferred is much smaller, which can increase the speed of transfer,i think. So it is different with convolution operation.Is there another sources or links for my operation?

Thank you very much.


1. You don’t need to include cudaMappedMemory.h.
cudaMallocManaged is included in the cuda_runtime.h already.

2. It’s recommended to check if this library can fulfill your requirement first:



Thanks for your reply.
I will try your advice.