A stupid question on __syncthread() function

Hi to everyone,

I have a stupid question related to the use of __syncthread() function and to some CUDA example code that use it.

I know that if I use the __syncthread() in a warp that diverge, the kernel function goes in deadlock. But in some CUDA example, when people try to explain how write a good code when the dataset problem is’t a multiple of 32, i read this kind of code:

//dataset 1M

// Id of thread in the block.
int localId = threadIdx.x;
// The first index of output element computed by this block.
int startOfBlock = blockIdx.x * blockDim.x;
// The Id of the thread in the scope of the grid.
int globalId = localId + startOfBlock;

if (globalId <= dataset){
    //do something and use shared memory
    __syncthread()
    //do something
    ...

}
return;

Also in this case threads in a warp can diverge! But this kind of code works well and the final result is correct! Why? It is releted to particular CUDA version or GPU compute capability?

In another example I read this:

//dataset 1M

// Id of thread in the block.
int localId = threadIdx.x;
// The first index of output element computed by this block.
int startOfBlock = blockIdx.x * blockDim.x;
// The Id of the thread in the scope of the grid.
int globalId = localId + startOfBlock;

if (globalId >= dataset)
    return;

//do something and use shared memory
__syncthread()
//do something
...
return;

I think that this example is better than the previous because in this case threads that does not meet the condition are killed. But also in this case, the __syncthread() function is not called by all 32 threads in a warp.

Can someone explain me the __syncthread() behaviour?

Thanks a lot.

The first example strikes me as obviously wrong. Here’s an attempt at fixing it.

if (globalId <= dataset){
    //do something and use shared memory
}
__syncthreads();
if (globalId <= dataset){
    //do something and use shared memory
}

I don’t know about the 2nd case, really.

You could run either case (with fault-triggering dataset sizes, i.e. not multiples of 32 and not a multiple of the block size either) through the Synccheck tool of cuda-memcheck.

If it returns a barrier error message, you will have a strong indication of possibly misbehaving code (if not on your current GPU hardware, then maybe in future architectures)

both cases are potentially incorrect. Depending on array sizes, they can prevent certain threads from reaching a barrier.

the programming guide says:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

It doesn’t say “it is guaranteed to hang”. It doesn’t say “it will always hang”.

So just because code doesn’t deadlock or hang, does not mean it is correct.

1 Like

The CUDA C programming guide has various little gems sprinkled throughout it. I wish I had it all memorized, but alas, I don’t. However item 3 here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x

is relevant to this issue, and may help to explain why sometimes __syncthreads() seems to be OK in non-full-warp situations.

Sorry to resurrect this topic, but I have the following question.

In this code the __syncthreads is used inside an if statement. Supposing the matrix is not multiple of 32, size of a warp, eventually the “potentially incorrect” scenario would occur. However, the code just runs fine. Why is that?

__syncthreads() isn’t guaranteed to cause trouble if it is used incorrectly. It may cause trouble. From here:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

Note usage of the word “likely”. That is not an absolute.

If you read the message immediately prior to yours, and follow that link and read item 3 as indicated, it says this:

In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier.

That’s the best I can offer for a pre-volta GPU. It certainly allows for the possibility that behavior might be OK. Not something you would want to actually code for, or rely on.

For the volta and beyond case, continuing in item 3:

Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block.

So there seems to be some special case around exited threads. My own observation is that if a barrier in Volta is met by all necessary threads, or if that barrier is met by all necessary and non-exited threads, it will succeed. These test cases are not particularly difficult to construct yourself.
This is just an observation. However, notice the very next sentence continuing in item 3:

Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.

Now that we have covered that material linked in this thread, can you speculate as to an answer to your question?

I haven’t studied the code carefully, and obviously the GPU you are running on might matter, but I would say either you are in pre-volta twilight zone, or else if on a volta or later GPU, the exception that allows exited threads to complete a barrier may be what is happening. It certainly looks like your question would revolve around exited threads, anyway.