Transfer data from GPU to CPU takes too much times on TX2

Hi,guys

I met data transferring problem on jetson TX2.

When i run inference(data from CPU to GPU,inference,data from GPU to CPU) on jetson TX2 based on my network(onnx format),i found that transferring data from GPU to CPU takes a lot of time.It took up about 80% of the inference time.

The size of data needed to transfer is 1x17x80x64. TensorRT version: 5.0.6.1, Linux version: ubantu 18.04.Copy function i using is cudaMemcpyAsync().

Maybe i can optimize this processing by following ways,but there still are some issues waiting to solve:

1.I can use pinned memory to improve memory copy times,but it looks like that it can not speed up my processing time.
2.In fact,i will process those data(1x17x80x64) to 1x2x17 by function which is implemented by “C++” after transfer data to CPU.I might implement this function by cuda in order to run on GPU,then just transfer small size data. So, can you provide some sources or links to help implement my function in cuda or tensorRT?

I would appreciate it if you have any advices and help!

right,
for input, it needs to transfer the data from cpu to gpu and transfer the output from gpu to cpu. it’s time-consuming.
hope any advices to help it out.
thanks