When testing the YOLOv5x model inference on the Jetson Orin NX 8GB, it was found that the CUDA memory copy operation takes a significant amount of time when transferring the model inference results to the CPU for post-processing. Additionally, when the network has multiple output nodes, only the data copy from the first node during each inference consumes a large amount of time, while the copy operations for other nodes take less than 1ms. Are there any good optimization methods for this issue?
Hi,
Since TensorRT inference is not a blocking call, could you add a synchronization call after the inference for a test?
Based on your source, the CPU might keep running procedures until the memory copy waits for the GPU to finish.
So the period might also include some other GPU tasks, not just memory copy.
Thanks.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.
