Hi,
I have run SSD using TensorRT on Jetson TX 1. I have observed that CUDA Memcpy (Host to Device) takes up almost 20% of the inference time and the number of calls is equal to 4506. Is this normal in deep learning architecture inferences that Memcpy consumes 20% time? Isn’t this what the TensorRT optimizes using Fusion of layers and decreasing the Memcpy from Host to Device ?
Thank you
Hi,
It depends on the use case.
Could you share following information with us?
- Have you used a plugin implementation
- Could you check if the memcpy occurs when image → TensorRT?
Here is an introduction of fusion on TensorRT:
https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/
Thanks.
Hi,
- Yes I have used plugin implementation.
- The two major time taking pieces of code are
2.1. trtwell_scudnn_winograd_128x128_ldg1_ldg4_mobile_relu_tile148t_nt → 20.17% time and 1442 calls
2.2. [CUDA memcpy HtoD] → 18.02% time and 4504 calls.
What does that layer mean in 2.1 ?
Thank you
Hi,
Could you share some information about the implementation of your plugin API?
Is it a pure CUDA code or a CPU tasks with memcopy?
Guess that the 2.1 operation is your first convolution operation and it is reasonable to take the most time of your task.
The 2.2 operation is the CUDA memcopy from CPU input buffer to the GPU.
To save this memory transferring time, it’s recommended to try our unified memory technique.
[url]https://devblogs.nvidia.com/unified-memory-cuda-beginners/[/url]
Thanks.