Illegal instruction (error 715) with H100

As additional documentation, there is also the description of the shfl ptx instruction. PTX ISA 8.5

There it is stated:

The behavior of shfl.sync is undefined if the executing thread is not in the membermask

I do not have more to add regarding ambiguous documentation.

If the documentation appears to be unclear, there is always the possibility to open a bug ticket asking for clarification and / or for suggesting changes. How to report a bug
To report a compiler defect, you will typically be asked to provide a self contained reproducer which can be compiled and executed.

Practically speaking, personally I would not bother writing my own parallel reduction unless it is for educational purpose. Parallel reduction is a well studied algorithm and is provided by cub which is part of the CUDA toolkit.
It has warp-wide, block-wide, and device-wide reduction. Documentation of the block-wide API can be found here:

https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockReduce.html#_CPPv4I0_i_20BlockReduceAlgorithm_i_i_iEN3cub11BlockReduceE