Transfer data from GPU to CPU takes too much times on TX2

wangsc_up · May 27, 2019, 12:49am

Hi,guys

I met data transferring problem on jetson TX2.

When i run inference(data from CPU to GPU,inference,data from GPU to CPU) on jetson TX2 based on my network(onnx format),i found that transferring data from GPU to CPU takes a lot of time(70ms).It took up about 80% of the inference time.

The size of data needed to transfer is 1x17x80x64. TensorRT version: 5.0.6.1, Linux version: ubantu 18.04.Copy function i using is cudaMemcpyAsync().

Maybe i can optimize this processing by following ways,but there still are some issues waiting to solve:

1.I can use pinned memory to improve memory copy times,but it looks like that it can not speed up my processing time.
2.In fact,i will process those data(1x17x80x64) to 1x2x17 by function which is implemented by “C++” after transfer data to CPU.I might implement this function by cuda in order to run on GPU,then just transfer small size data. So, can you provide some sources or links to help implement my function in cuda or tensorRT?

I would appreciate it if you have any advices and help!

AastaLLL · May 28, 2019, 2:38am

Hi,

Please remember to maximize the system performance first.
Memory copy should just take few milliseconds.

It’s recommended to use unified memory.
Here is a sample for your reference:
[url]https://github.com/dusty-nv/jetson-inference/blob/master/tensorNet.cpp#L888[/url]
Which kind of process do you want? Is it a convolution operation?
If yes, here is a sample for your reference:
[url]https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#mnistapi_sample[/url]

Thanks.

wangsc_up · May 28, 2019, 5:46am

Hi,@AastaLLL:

Thank you very much for your help.

I tried to use unified memory. Maybe i have to include . But when i want to implement it, the compiler give a error like "fatal error: cudaMappedMemory.h No such file or directory". Do I need to configure any new packages? Actually, I'm a newcomer to tensorRT.
I'd like to explain about my operation.I want to find the maximum value and its position(N x 17 x3) from data in GPU(N x 17 x 80 x 64), then execute add or subtract on maximum value and its position and then transfer to the CPU. By contrast, the amount of data that needs to be transferred is much smaller, which can increase the speed of transfer,i think. So it is different with convolution operation.Is there another sources or links for my operation?

Thank you very much.

AastaLLL · June 4, 2019, 6:25am

Hi,

1. You don’t need to include cudaMappedMemory.h.
cudaMallocManaged is included in the cuda_runtime.h already.

2. It’s recommended to check if this library can fulfill your requirement first:
[url]https://developer.nvidia.com/npp[/url]

Thanks.

wangsc_up · June 5, 2019, 2:03am

Hi,@AastaLLL:

Thanks for your reply.
I will try your advice.

Thanks.

Topic		Replies	Views
Transfer data from GPU to CPU takes too much times on TX2 TensorRT	1	1271	August 9, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	309	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	250	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	556	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	243	May 23, 2019
Transfer data from GPU->CPU takes too much time. TensorRT	0	277	May 23, 2019
transfer data from GPU->CPU takes too much time. TensorRT	0	294	May 23, 2019
data transfer cost a lot of time Jetson TX2	2	743	October 18, 2021
Transfer data CPU/GPU is an issue.. Jetson TX2	8	1876	October 18, 2021
How can I divide/segment the function so that I can measure ONLY the inference time in the GPU? Jetson TX2 jetson-inference	4	404	October 18, 2021

Transfer data from GPU to CPU takes too much times on TX2

Related topics