Strange __syncthreads behavior

Hi,

I’m observing a strange __syncthreads behavior. Please can anyone tell me what’s going on.
I thought the second __syncthreads in line 6 of thread0 should block until all other threads reach the last __syncthreads in line 15.

__global__ void sync_test(void){
  printf("a\n");
 
  if(threadIdx.x == 0){
    __syncthreads();
    __syncthreads();
  } else {
    __syncthreads();
  }
 
  printf("b\n");
 
  if(threadIdx.x != 0){
    printf("c\n");
    __syncthreads();
  }
}

int main(void) {
  sync_test<<<1, 4>>>();
  cudaDeviceSynchronize();
  return 0;
}

How can I get the desired output?

prints:
a
a
a
a
b
b
b
b
c
c
c
 
desired:
a
a
a
a
b
b
b
c
c
c
b

Thanks!

From the CUDA user guide:

“__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.”

Therefore, if you execute your code with more than 32 threads, it will lock on the __syncthreads calls.

From the CUDA user guide:

__syncthreads() is you garden variety thread barrier. Any thread reaching the barrier waits until all of the other threads in that block also reach it.

pay atention to the words “also reach it”.
You program :

printf("a\n");

if(threadIdx.x == 0)
{
    __syncthreads(); // this command only impact on thread 0 
    __syncthreads(); //this command only impact on thread 0 
}
else
{
    __syncthreads(); // this command impact on thread 1,2,3
}
 printf("b\n");

if(threadIdx.x != 0)
{
    printf("c\n");
    __syncthreads();// this command impact on thread 1,2,3
}