__shfl_down_sync reduction in OptiX 7.0

Can the shuffle functions be used in the camera ray function to reduce the results from a warp to a single color? I would like to use this technique, if possible, to get an entire warp to cast sample rays from a single pixel in order to keep the rays as coherent as possible. When I tried this, the code compiled but did not execute so I suspect that the shuffle functions are not allowed in OptiX 7.0. The atomic operations are working and may be a work around. Any feedback would be appreciated.

Hey there,

This is a good question from a CUDA perspective. The answer, however, is that it’s not recommended in OptiX to attempt any intra-warp syncronization or communication. The programming model intentionally provides a ‘single-thread’ view to your shader programs, and there’s no official way to map from an OptiX launch index to a warp.

“For efficiency and coherence reasons, the NVIDIA OptiX runtime—unlike CUDA kernels—allows the execution of one task, such as a single ray, to be moved at any point in time to a different lane/thread, warp or streaming multiprocessor (SM). Consequently, applications cannot use shared memory, synchronizations, barriers, or other SM-thread-specific programming constructs in their programs supplied to OptiX.”


Atomics don’t break these rules, so they are fair game, but depending on what you’re doing they can hurt your perf more than help. I’d recommend keeping it very simple until you have evidence of a performance problem with incoherent traversal. A for loop in raygen generally works well for antialiasing, for example. If you’re certain that pixel-to-pixel coherence is an issue that tracing sub-pixel rays in parallel will solve, probably the first thing to try is rendering a super-sampled image and then do your reduction later in a CUDA kernel. If that wouldn’t work in your case, we can certainly discuss specifics and gather some more advanced recommendations.


Thanks! That is pretty much what I expected. I will test with various configurations and see what is fastest. OptiX is a bit of a black box but I can imagine how rays terminating sooner could be combined so that they all complete together but in a different warp configuration compared to when they were launched.