Help Post: Some Issues Regarding Deploying TensorRT to Accelerate YOLOv5 on Jetson Orin Nano

Hello , I’m a student and I’m learning to deploy TensorRT to accelerate YOLOv5 on Jetson Orin Nano. By using Nsight Systems to monitor the program’s running status, I found that the cudaMemcpyAsync in the post-processing takes a long time. Especially when the preprocessing of the second image and the post-processing of the first image are running in parallel, both the preprocessing and post-processing cudaMemcpyAsync consume a lot of time. The subsequent preprocessing thread’s cudaMemcpyAsync time consumption is not so much; while the post-processing cudaMemcpyAsync time consumption is almost always higher than that of the preprocessing. I wonder if this is because my code is not well-written or if there are other reasons? I will attach my code and the nsys-rep file below. Thank you for your reading. If you have any relevant suggestions, please reply to me. Thank you very much and wish you a pleasant life.
threethread.zip (41.3 MB)

Hi,

In your profiling output:

The blue part is TensorRT inference.
The red one is used for the DtoH memory copy. Is this the post-processing you indicate?
And the green one is the HtoD, representing the pre-processing?

Instead of using pre_copy, post_copy and infer streams, could you try to create a dependent stream for each threads (one thread per frame?) so the tasks belongs to different frames can run parallel.

Thanks.

Thank you for your reply! I will try it.

Is this still an issue to support? Any result can be shared?