Illegal instruction (error 715) with H100

I’ll try this, too.

Or actually you would need to have the receiving and the sending thread participate.

float rcv = __shfl_down_sync(0xFFFFFFFF, val, 1);
if (threadIdx.x < 2)
   val += rcv;

or

if (threadIdx.x < 4)
    val += threadIdx.x < 2 ? __shfl_down_sync(0xF, val, 1, 4) : 0;

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.

Either the whole program is UB or only the value is undefined?

Updating to the 560 driver and 12.6 toolkit has resolved the issue.

Thanks for everyone’s help.

Even if the error does not come up any more, not fulfilling the requirements stated in the programming manual, which lanes should or should not participate depending on mask and width parameters, is sub-optimal. If it is UB, it could give the wrong results any time or with the next toolkit release. An error is the better alternative.
If I understand the thread correctly, even some sample programs do it wrongly?
Or they use the loophole that receiving from not participating lanes gives just an undetermined value and the val variable of those receiving threads is not used anyway afterwards?

I don’t disagree.

The problem is that the language is contradictory. And note, btw, “undefined” is not the same as “illegal instruction.” The latter indicates the instruction generated by the compiler is incorrect whereas the former indicates the behavior is at issue.

Furthermore, the language did change from 11.8 to 12.0, indicating something must have changed conceptually. The mask is essentially useless with this interpretation of the language. I think Nvidia needs to clarify.

EDIT: Swapped former and latter above.

The more I read this “or the result is undefined” the more I think that means “for threads that don’t participate.” It’s not saying all of the results are undefined, just the threads that don’t have their bit set in the mask. The word result is used, not results. There are 32 results per warp. Threads that have their bit set in the mask each still have a valid result.

For the data to be defined the executing thread itself and the thread, from which the data is coming from, have to be set in the mask.

T shfl_down_sync(mask, var, delta, width)
{
    lane = currentLane();
    pos = lane & (width - 1);
    group = lane - pos;
    if (!(mask & (1 << lane)))
        return undefined; // or UB?
    from = pos + delta;
    if (from >= width)
        return var;
    if (!(mask & (1 << (group +from))))
        return undefined; // or UB?
    return var_from[group + from];
}

Correct, though not undefined behavior, undefined result. The thread reduction masks off threads you don’t care about. The language doesn’t say all of the threads have undefined data, only the threads that don’t have a bit set.

For example:

val += __shfl_down_sync(0xFF, val, 4);

The only result that matters is from threads 0-3, which get their results from threads 4-7, all of which are set.

Edit: threads 4-7, not 5-8. My inner MATLAB is on display.