VPI efficiency issues

I’m running a performance application where I want to do realtime depth estimation.
For that matter I run VPI Harris on a stereo image pair and then do some further CUDA processing.
I have the issue that there’s a severe performance hit when I switch from VPI processing to CUDA processing with the latest VPI 2.0.

I want to achieve back-to-back execution and have a 100% GPU utilization at all times, but because of this:

There’s a very painful sync/launch overhead (seen with red arrows here)

Do you have any suggestion how I can mitigate this and get better GPU utilization?
Running the 2 harris calls on multiple streams gets me a back-to-back execution in the middle bettween the 2 harris invocations, but the launch and sync overhead in the beginning/end I cannot resolve efficiently.

Any suggestion is most welcome.

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Have you tried to launch the task and the same CUDA stream and then synchronize together?
The task on the same CUDA stream is guaranteed to be executed in order.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.