Avoid synchronization in optixLaunch

Dear all,

I am currently profiling an application using NSight Systems to remove CPU<->GPU synchronization points and improve performance. While doing this, I realized that optixLaunch appears to call cuStreamSynchronize internally (nsight systems can be set up to capture the backtrace of CUDA synchronization calls, and optixTrace shows up as cause).

My code submits work to a custom CUDA stream that has been created in non-blocking mode, and that includes the “optixLaunch” call. I am not generating concurrent OptiX launches, what I want is simply that the CPU can run ahead while the GPU is busy to generate the launch needed for the next frame. However, this all falls apart if optixLaunch then does further synchronization.

Ideas?

Thanks,
Wenzel

Another potentially synchronization-related issue that I observe is that optixAccelBuild generates various “NVIDIA internal” kernels, but then ends up stuck for a long time in a cuMemcpyAsync operation. What seems to be happening here is that the acceleration structure build is actually synchronizing with the CPU that is now waiting for a preceding rendering step to finish.

In case it matters, this is on x86_64 linux (Ubuntu 20.04) with driver 515.48.08.

Hi @wenzel.jakob!

So looking at the code for optixLaunch(), I do see one call to cuStreamSychnorize() that happens only when validation mode debug exceptions are enabled. Is that the case here, do you have validation mode enabled? (Validation mode intentionally serializes OptiX launches in order to facilitate debugging and rule out all the difficult async problems.)


David.

For the optixAccelBuild(), that sounds different. I’m just thinking out loud… is the device pointer used for the memcpy also used in any preceding stream API calls? Is there any possibility that the host buffer could be paged out when the call is made? I’ll ask the CUDA team what reasons a cudaMemcpyAsync() call might stall or synchronize.


David.

Perhaps this thread is relevant? Robert notes that multiple memory operations will serialize due to PCI rules, and the rest of the thread has some good hints too I think: cudaMemcpyAsync - #2 by Robert_Crovella


David.