Hello , I’m a student and I’m learning to deploy TensorRT to accelerate YOLOv5 on Jetson Orin Nano. By using Nsight Systems to monitor the program’s running status, I found that the cudaMemcpyAsync in the post-processing takes a long time. Especially when the preprocessing of the second image and the post-processing of the first image are running in parallel, both the preprocessing and post-processing cudaMemcpyAsync consume a lot of time. The subsequent preprocessing thread’s cudaMemcpyAsync time consumption is not so much; while the post-processing cudaMemcpyAsync time consumption is almost always higher than that of the preprocessing. I wonder if this is because my code is not well-written or if there are other reasons? I will attach my code and the nsys-rep file below. Thank you for your reading. If you have any relevant suggestions, please reply to me. Thank you very much and wish you a pleasant life.
threethread.zip (41.3 MB)
Hi,
In your profiling output:
The blue part is TensorRT inference.
The red one is used for the DtoH memory copy. Is this the post-processing you indicate?
And the green one is the HtoD, representing the pre-processing?
Instead of using pre_copy, post_copy and infer streams, could you try to create a dependent stream for each threads (one thread per frame?) so the tasks belongs to different frames can run parallel.
Thanks.
Thank you for your reply! I will try it.
Is this still an issue to support? Any result can be shared?