As additional documentation, there is also the description of the shfl
ptx instruction. PTX ISA 8.5
There it is stated:
The behavior of
shfl.sync
is undefined if the executing thread is not in themembermask
I do not have more to add regarding ambiguous documentation.
If the documentation appears to be unclear, there is always the possibility to open a bug ticket asking for clarification and / or for suggesting changes. How to report a bug
To report a compiler defect, you will typically be asked to provide a self contained reproducer which can be compiled and executed.
Practically speaking, personally I would not bother writing my own parallel reduction unless it is for educational purpose. Parallel reduction is a well studied algorithm and is provided by cub which is part of the CUDA toolkit.
It has warp-wide, block-wide, and device-wide reduction. Documentation of the block-wide API can be found here: