Incorrect use of syncthreads

Hello all,

I have a kernel like this:

int myidx = threadIdx.x + blockIdx.x*blockDim.x;

if ( myidx.x < limit ) {

 read_from_memory();

  calculations...

  first_write_to_memory();

  __syncthreads();

  second_write_to_memory();

}

else {

 __syncthreads();

}

But if I run the code in emulation mode, it reports an incorrect usage of syncthreads. If I understand it right, all threads always reach one syncthreads. Any insight what the problem may be?

Thanks.

If 1 thread reaches a syncthreads, all threads need to reach that same syncthreads.

Thanks.

Changing the code to:

int myidx = threadIdx.x + blockIdx.x*blockDim.x;

if ( myidx < limit ) {

  read_from_memory();

  calculations...

  first_write_to_memory();

}

__syncthreads();

if ( myidx < limit ) {

    second_write_to_memory();

}

Fixed the problem.

Mmm, just something I can’t get off my head. If I run the first code in the device, it doesn’t create a deadlock.
Is the emulation error just a kind of a warning and the actual implementation of syncthreads allows that kind of synchronization?

[edit: typo]

I think so. I coded in the same way that different thread may reach different syncthreads and it worked on the device. So I guess it’s just kind of warning.

I would like to get this clear, as the documentation kinda allows for both interpretations.

So, could someone please clarify which of these 2 is the correct one?

A. All threads in a warp MUST reach the same syncthreads. It may work with current hardware, but it isn’t guaranteed to work in the future.

or

B. All threads in a warp just need to reach the same amount of syncthreads. Error in emulation mode can be considered a warning.

I insist with this because option B generally makes coding certain things easier, and if future devices will also allow it, I would like to continue coding in that way.

Thanks.

The Ptx documenation states clearly that the bar instructions which presumably implement syncthreads waits until all threads of a CTA(thread block) reach that same point. In device mode, does the kernel merely fail to crash or does it actually yield valid results?

I’ve heard from some others that the current hardware is “forgiving” and will usually work, regardless of which __syncthreads() is reached.

Given what the ptx documenation says, I wouldn’t be surprised if future hardware enforces this requirement.

But if I arrange the code as follows, right?

global void test_kernek(unsigned int num,unsigned int blocknum)

{

 int bid=blockIdx.x;

 if(num%blocknum==bid)

 {

       //do something

      __syncthreads();

      //do  something

   }

 else 

 {

      //do something

 } 

}

That’s is: if one thread reaches a _syncthreads(),the others are also reach at the same _syncthreads();if no thread can reach the _syncthreads(),the other will not too.i.e. all the threads follow the same routine.Am I right?