__shfl_down_sync reduction in OptiX 7.0

grangerfx1nl45 · May 13, 2020, 10:18pm

Can the shuffle functions be used in the camera ray function to reduce the results from a warp to a single color? I would like to use this technique, if possible, to get an entire warp to cast sample rays from a single pixel in order to keep the rays as coherent as possible. When I tried this, the code compiled but did not execute so I suspect that the shuffle functions are not allowed in OptiX 7.0. The atomic operations are working and may be a work around. Any feedback would be appreciated.

dhart · May 13, 2020, 11:11pm

Hey there,

This is a good question from a CUDA perspective. The answer, however, is that it’s not recommended in OptiX to attempt any intra-warp syncronization or communication. The programming model intentionally provides a ‘single-thread’ view to your shader programs, and there’s no official way to map from an OptiX launch index to a warp.

“For efficiency and coherence reasons, the NVIDIA OptiX runtime—unlike CUDA kernels—allows the execution of one task, such as a single ray, to be moved at any point in time to a different lane/thread, warp or streaming multiprocessor (SM). Consequently, applications cannot use shared memory, synchronizations, barriers, or other SM-thread-specific programming constructs in their programs supplied to OptiX.”

https://raytracing-docs.nvidia.com/optix7/guide/index.html#introduction#general-description

Atomics don’t break these rules, so they are fair game, but depending on what you’re doing they can hurt your perf more than help. I’d recommend keeping it very simple until you have evidence of a performance problem with incoherent traversal. A for loop in raygen generally works well for antialiasing, for example. If you’re certain that pixel-to-pixel coherence is an issue that tracing sub-pixel rays in parallel will solve, probably the first thing to try is rendering a super-sampled image and then do your reduction later in a CUDA kernel. If that wouldn’t work in your case, we can certainly discuss specifics and gather some more advanced recommendations.

–
David.

grangerfx1nl45 · May 13, 2020, 11:45pm

Thanks! That is pretty much what I expected. I will test with various configurations and see what is fastest. OptiX is a bit of a black box but I can imagine how rays terminating sooner could be combined so that they all complete together but in a different warp configuration compared to when they were launched.

Topic		Replies	Views
Allowing multiple threads to process a single pixel. OptiX	5	1162	June 14, 2022
How CUDA Warp(s) relate to OptiX 7 Ray(s) OptiX	3	841	June 14, 2022
Task scheduling in OptiX 7 OptiX	6	1223	October 12, 2021
Use of warp-level primitives OptiX	2	573	December 29, 2022
[OptiX] Using shared memory in OptiX OptiX	2	879	December 28, 2022
Request for the clarification of the "Single Ray Programming Model" OptiX	2	56	December 10, 2024
Using tensor cores in Optix OptiX	4	1394	June 15, 2022
[OptiX 7] About rays switching lanes/threads OptiX	2	931	June 14, 2022
DXR - Inline Raytracing equavilent on OptiX OptiX	4	952	December 6, 2022
Concurrent access and growable buffer OptiX	8	859	June 14, 2022

__shfl_down_sync reduction in OptiX 7.0

Related topics