I’m running a performance application where I want to do realtime depth estimation.
For that matter I run VPI Harris on a stereo image pair and then do some further CUDA processing.
I have the issue that there’s a severe performance hit when I switch from VPI processing to CUDA processing with the latest VPI 2.0.
I want to achieve back-to-back execution and have a 100% GPU utilization at all times, but because of this:
There’s a very painful sync/launch overhead (seen with red arrows here)
Do you have any suggestion how I can mitigate this and get better GPU utilization?
Running the 2 harris calls on multiple streams gets me a back-to-back execution in the middle bettween the 2 harris invocations, but the launch and sync overhead in the beginning/end I cannot resolve efficiently.
Any suggestion is most welcome.