Warps not executing in lock-step

jaredlo · October 17, 2023, 7:43pm

Hello!
I am running CUDA code on the RTX 3090 GPU. I am trying to find the sum of integers in an array using a full binary tree. I structured the binary tree in the same way one would structure a max-heap with 1-indexing (I just waste the 0th index item). In the code I have pasted below, all my threads are organized linearly, having block dimensions of dim3(256, 1, 1).

In the commented-out section, I had 8 steps of upsweep operating with __syncthreads() between each operation. This worked fine across many different inputs (over 4GB of inputs). However, I thought that for the first 5 steps of upsweep, I probably wouldn’t need a __syncthreads() operation since each of the operations are operating in lock-step within a warp, so I separated the for loop into two different loops, one that performs the first 5 steps of upsweep in lockstep, and one that performs the last 3 steps using __syncthreads() operations in between. However, this fails for some reason. Could somebody explain what is going on here?

#define threads 256
#define logthreads 8
__global__ void upsweep(int *arr, int *d_n, int *ws)
{
    __shared__ int heap[2 * threads];
    int left, right, pow2, parent_index;
    heap[threadIdx.x] = 0;
    heap[threadIdx.x + threads] = arr[blockIdx.x * threads + threadIdx.x];
    __syncthreads();
    // Only needs to be done if threadIdx.x % 2 == 0, but might as well compute it for the rest since there is no additional runtime
    left = heap[threadIdx.x + threads];
    right = heap[threadIdx.x + 1 + threads];
    pow2 = 2;
    parent_index = (threads + threadIdx.x) / pow2;
    // This is where I substituted the code for the commented section below.
    for (unsigned char i = 0; i < 5; i++)
    {
        if (threadIdx.x % pow2 == 0)
        {
            left += right;
            heap[parent_index] = left;
            right = heap[parent_index + 1];  // reassign for next layer
            parent_index = parent_index >> 1; // reassign for next layer
        }
        pow2 = pow2 << 1;
    }

    __syncthreads();

    for (unsigned char i = 5; i < logthreads; i++)
    {
        if (threadIdx.x % pow2 == 0)
        {
            left += right;
            heap[parent_index] = left;
        }
        __syncthreads();
        if (threadIdx.x % pow2 == 0)
        {
            right = heap[parent_index + 1];  // reassign for next layer
            parent_index = parent_index >> 1; // reassign for next layer
        }
        pow2 = pow2 << 1;
    }
    /*

    for (unsigned char i = 0; i < logthreads; i++)
    {
        if (threadIdx.x % pow2 == 0)
        {
            left += right;
            heap[parent_index] = left;
        }
        __syncthreads();
        if (threadIdx.x % pow2 == 0)
        {
            right = heap[parent_index + 1];  // reassign for next layer
            parent_index = parent_index >> 1; // reassign for next layer
        }
        pow2 = pow2 << 1;
    }
    */
    // additional code after here...
}

striker159 · October 17, 2023, 8:19pm

With Independent Thread Scheduling introduced in the Volta generation threads within a warp are not guaranteed to execute in lock-step.

Robert_Crovella · October 17, 2023, 10:18pm

This sort of thinking is officially deprecated and unsafe. Code written that way is by definition broken, regardless of what output it produces.

jaredlo · October 19, 2023, 3:24am

Thank you for the link! I read through the article; is it correct to say that warps operating in lock-step is only guaranteed when using the warp-level primitives or after a __syncwarp() operation with the proper bitmask?

jaredlo · October 19, 2023, 3:25am

I see, that explains the confusion I had with this concept. Thanks for your help!

Robert_Crovella · October 19, 2023, 1:43pm

syncwarp is an execution barrier, and like syncthreads, it also has some memory-barrier effects. You can read about its semantics in the programming guide. It tells you nothing about what will happen after the syncwarp. Assuming that syncwarp or syncthreads guarantees some sort of convergence after the barrier is again incorrect thinking.

Yes, _sync()-variant primitives that accept a bitmask guarantee for that instruction only, that there will be the necessary synchronization to provide the requested warp-convergence. (This assumes you have not prevented the requested convergence via forced divergence through conditional code that prevents an appropriate activity. This is a complex topic to spell out in all detail, with appropriate disclaimers based on GPU architecture)

Topic		Replies	Views
are threads of a warp really sync? CUDA Programming and Performance	2	855	August 3, 2011
Warp Synchronisation Problem? CUDA Programming and Performance	4	3928	November 30, 2008
Why does single warp need syncthreads? CUDA Programming and Performance	2	1986	January 24, 2012
Race condition within warp CUDA Programming and Performance	9	3188	September 20, 2016
Execute different instruction for each warp and synchronize CUDA Programming and Performance	6	1491	November 22, 2011
syncronize a warp CUDA Programming and Performance	8	2920	August 25, 2008
Is syncthreads required within a warp? CUDA Programming and Performance	10	12738	November 8, 2013
Problems with __syncthreads() CUDA Programming and Performance	2	954	May 4, 2013
order of execution in a divergent warp CUDA Programming and Performance	27	4860	January 17, 2012
What happens when I call __syncthreads() in a warp group? CUDA Programming and Performance	6	215	June 27, 2025

Warps not executing in lock-step

Related topics