What happens when I call __syncthreads() in a warp group?

Here is the code:

I launch 256 threads in a block, and it’s divided into 2 warp groups.

    int warp_group_idx = cutlass::canonical_warp_group_idx();


    if (warp_group_idx == 0) { 
           // do some thing
          __sync_threads();
    } else {
          // do others
    }

Is it legal to use like this?

Is there any documentation about this?

In my knowledge, it’s illegal and will cause dead lock.

But I see it used in Flash MLA: FlashMLA/csrc/flash_fwd_mla_kernel.h at b549289fb4ce2a0a4a1d5e9e615a17a2e09f3286 · deepseek-ai/FlashMLA · GitHub

That seems to be suspicious.
Could it correspond to the __syncthreads() in line 400, which also is inside a loop from n_block to n_block_min?

Thank you! Seems that’s the fact.

Can __syncthreads() be used in this situation? I have never seen some documentation about that.

if(x > 0) {
__syncthreads();
} else {
__syncthreads();
}

In this code, the two __syncthreads()are the same synchronization point? Why is that?

I would avoid this, if possible.
__syncthreads() is translated into bar.sync instructions.
They are available as .aligned and non-aligned variants.

The .aligned ones have to be a single instruction, the non-aligned ones can have the same instruction at multiple locations for the still running threads of a warp.

But at the same time the C++ compiler has to know that those two __syncthreads() instructions are executed together.

Let’s avoid it.
For example replace

    if (b) {
        T1();
        __syncthreads();
        T2();
    } else {
        F1();
        __syncthreads();
        F2();
    }

with

    if (b) {
        T1();
    } else {
        F1();
    }
    __syncthreads();
    if (b) {
        T2();
    } else {
        F2();
    }

Wow, I see that. So how can the compiler know they correspond to each other?

Does it search __syncthreads in the other branch, and relate them together?

I do not believe, it fully can. So it would perhaps always translate to the non-aligned variant.

The construct is quite brittle, if not managed well by the programmer.

Whether it works, can be seen at runtime.

The compiler has to make sure to keep the instruction and the order of instructions.

1 Like