[…] applications cannot use shared memory, synchronization, barriers, or other SM-thread-specific programming constructs in their programs supplied to OptiX.
(The OptiX compiler will normally just reject such forbidden constructs when it finds them in PTX code. )
My team works on differentiable rendering, and such algorithms are very write-heavy. (In a GPU kernel, the derivative of a global memory read turns into a global memory scatter-add) When doing lots of global memory atomics, warp-level primitives can be very useful: by performing a local reduction first, the number of global memory operations can be reduced by up to a factor of 32.
While looking into this, we noticed that the advice in the OptiX guidelines is perhaps a bit more strict than it needs to be. For example, OptiX seems to accept __shfl_sync() and __match_any_sync() under certain conditions: when the active mask exactly equals __activemask() (there is even an error message pointing this out when using another kind of mask).
For these reasons, I was curious if the documentation is possibly out of date, and what we are allowed to do in practice?
You’re right that OptiX doesn’t currently strictly or absolutely enforce what the Programming Guide says. I believe the team intends to update this guidance with the next version of (SER-enabled) OptiX. However, I don’t know that the summary will change at all. There are multiple reasons we advise against using SM-level primitives, even outside of SER. One of them is that OptiX uses some SM-level intrinsics internally, and the use is proprietary and subject to change, so usage of these constructs in user programs can have unintended interactions and consequences. Another reason for the gap between what we guarantee in writing and what happens in today’s compiler is that we are reserving the right to move threads at any time, including in the middle of shader execution, at times where it might not be occurring today.
There’s nothing wrong with experimenting, just keep in mind that the guidance in the Programming Guide is usually more intentional and correct than what any given version of the compiler allows. OptiX is intentionally designed to provide a single-threaded programming model to the user, and is not designed for the warp-level and block-level constructs that CUDA provides. We appreciate hearing about it when you find cases where OptiX does something you didn’t expect. It is true that some warp-level instrinsics like shuffle-sync and warp-vote will technically work in certain cases today, I’m just not sure we can promise they will continue to work. So as you experiment it might be a good idea to keep track of areas of your code that deviate from our guidance, and what the alternatives are, in case they change behavior in the future?
thanks for the quick response, that sounds great (and matches what we are doing presently). We noticed that doing local reductions works fine in the raygen shader but can produce incorrect results when used within a direct callable, so we’re only using them in the first case.
One feature request that you might consider for a future OptiX version is a hypothetical operation:
template <typename T> void atomicAddReduce(T *address, T value);
This operation would try to do local reductions to minimize the number of atomic memory transactions while expanding to code that works with OptiX (including within callables).
Atomic adds (compared to other atomic operations like minimum, maximum, subtraction, etc.) are particularly useful in a graphics context:
they can be used to write to the film when a reconstruction filter is being used
as mentioned above, an atomic add is the derivative of a read (e.g. to access a shader parameter). Because of that, any shader code, when differentiated, will generate a large number of atomic adds.
While it is possible to do the scatter-adds outside of OptiX in a separate kernel, this comes with a significant extra cost especially in the uncontended case (and it isn’t always easy to know if a particular memory address is contended or not).