Confusion about __syncwarp() if all threads in a warp are automatically in sync?

Hi guys,

While reading the docs I came to the conclusion that threads in a warp are always synchronized? Because the SM operates with warps with SIMT model, i.e. it issues the same instruction concurrently to each thread in a warp.

So if we have a source

"
global f() {
line_1
line_2

}
"
which translates to machine code (SASS) as
"
IR_1
IR_2

IR_N
"

each thread in a warp will get the instruction IR_k at the same time.

So what’s the use of “__syncwarp()”? Where am I wrong?

I would suggest reading more of the docs, and all will be revealed :-) In particular, keep on the lookout for word “divergence”.

I did, and look what it says under the “4. Hardware implementation; 4.1 SIMT Architecture”:

“Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.”

It says in the first sentence that warps use a single program counter. So at each clock cycle the same instruction is issued to the whole warp. So my question is: how can threads be out of sync intra-warp, i.e. what is the use of __syncwarp()?

Please help me, I don’t understand…

IIRC, __syncwarp() is introduced in CUDA 9, and is for architectures that are not prior to Volta. :) You may be interested in this post: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/