I met data transferring problem on jetson TX2
When i run inference(data from CPU to GPU ,inference ,data from GPU to CPU) on jetson TX2 based on my deep learning network(onnx format) , i found that transferring data from GPU to CPU takes a lot of time. It took up about 80% of the time.
The size of data needed to transfer is 1x17x80x64. TensorRT version: 5.1; Linux version: ubantu 18.04. Copy function i using is cudaMemcpyAsync();
Maybe i can optimize this processing by following ways, but there still are some issues wanted to solve:
- I can use pinned memory to improve memory copy times, is it an efficient way and how i can implement?
- In fact ,i will process those data(1x17x80x64) to 1x2x17 by function which is implemented by “C++” after transfer data to CPU, i
might implement this function by cuda in order to run on GPU ,then just transfer small size data(1x2x17). Is it a efficient
way to optimize? Can you provide some links to help to implement my function in cuda?
I would appreciate it if you have any advices and help！