Why Does __shfl_down_sync Work with Decreasing Offsets but Deadlock with Increasing Offsets?

I have a question about __shfl_down_sync. Does it only work when offsets decrease (e.g., 2 → 1)? If I use increasing offsets (e.g., 1 → 2), it seems to deadlock.

For example:

Method 1:

  • 1 transfers to 0, 2 transfers to 1, 3 transfers to 2 (offset = 1);
  • Then, 2 transfers to 0, and 3 transfers to 1 (offset = 2).

Method 2:

  • 2 transfers to 0, and 3 transfers to 1 (offset = 2);
  • Then, 1 transfers to 0, 2 transfers to 1, and 3 transfers to 2 (offset = 1).

Why does Method 1 deadlock, while Method 2 works?

=========/////////////////
code is something like:

for (int offset = 2; offset >0; offset /= 2) {
    accum = __hadd2(accum, __shfl_down_sync(0xffffffff, accum, offset));
}

=========/////////////////

Is it possible that the last few threads in __shfl_down_sync are inactive, and if we increase the offset, it touches these inactive threads, causing the program to deadlock? (Sorry, this is just my vague guess.) Could this be the reason for the issue?