Avoid synchronization in optixLaunch

wenzel.jakob · July 21, 2022, 11:45am

Dear all,

I am currently profiling an application using NSight Systems to remove CPU<->GPU synchronization points and improve performance. While doing this, I realized that optixLaunch appears to call cuStreamSynchronize internally (nsight systems can be set up to capture the backtrace of CUDA synchronization calls, and optixTrace shows up as cause).

My code submits work to a custom CUDA stream that has been created in non-blocking mode, and that includes the “optixLaunch” call. I am not generating concurrent OptiX launches, what I want is simply that the CPU can run ahead while the GPU is busy to generate the launch needed for the next frame. However, this all falls apart if optixLaunch then does further synchronization.

Ideas?

Thanks,
Wenzel

wenzel.jakob · July 21, 2022, 12:02pm

Another potentially synchronization-related issue that I observe is that optixAccelBuild generates various “NVIDIA internal” kernels, but then ends up stuck for a long time in a cuMemcpyAsync operation. What seems to be happening here is that the acceleration structure build is actually synchronizing with the CPU that is now waiting for a preceding rendering step to finish.

wenzel.jakob · July 21, 2022, 12:05pm

In case it matters, this is on x86_64 linux (Ubuntu 20.04) with driver 515.48.08.

dhart · July 21, 2022, 5:37pm

Hi @wenzel.jakob!

So looking at the code for optixLaunch(), I do see one call to cuStreamSychnorize() that happens only when validation mode debug exceptions are enabled. Is that the case here, do you have validation mode enabled? (Validation mode intentionally serializes OptiX launches in order to facilitate debugging and rule out all the difficult async problems.)

–
David.

dhart · July 21, 2022, 5:50pm

For the optixAccelBuild(), that sounds different. I’m just thinking out loud… is the device pointer used for the memcpy also used in any preceding stream API calls? Is there any possibility that the host buffer could be paged out when the call is made? I’ll ask the CUDA team what reasons a cudaMemcpyAsync() call might stall or synchronize.

–
David.

dhart · July 21, 2022, 5:54pm

Perhaps this thread is relevant? Robert notes that multiple memory operations will serialize due to PCI rules, and the rest of the thread has some good hints too I think: cudaMemcpyAsync - #2 by Robert_Crovella

–
David.

Topic		Replies	Views
Optix 6.5 - interleaving CUDA kernels OptiX	2	765	October 12, 2021
OPTIX, memory allocated on the cpu side OptiX	5	1029	June 15, 2022
cudaStreamCreate() calls in OptixHello and OptixTriangle samples OptiX	1	430	January 16, 2023
Freeze on Sync after launch OptiX	9	1552	June 14, 2022
Issues running OptiX concurrently with a CUDA kernel that uses shared memory OptiX	2	588	December 11, 2023
optixAccelBuild sometimes takes very long OptiX ray-tracing , optix	5	1004	May 19, 2023
Non-blocking rendering and overlapped launching OptiX	2	507	June 14, 2022
Optix7.0: Could I use two streams for two optixLaunch operation in two threads for speed-optimize? OptiX	2	1053	June 14, 2022
4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour. CUDA Programming and Performance	5	13400	March 10, 2011
cudaDeviceSynchronize() call after optixLaunch(...) results in cudaErrorIllegalAddress OptiX	5	1384	December 14, 2022

Avoid synchronization in optixLaunch

Related topics