Warps not executing in lock-step

Hello!
I am running CUDA code on the RTX 3090 GPU. I am trying to find the sum of integers in an array using a full binary tree. I structured the binary tree in the same way one would structure a max-heap with 1-indexing (I just waste the 0th index item). In the code I have pasted below, all my threads are organized linearly, having block dimensions of dim3(256, 1, 1).

In the commented-out section, I had 8 steps of upsweep operating with __syncthreads() between each operation. This worked fine across many different inputs (over 4GB of inputs). However, I thought that for the first 5 steps of upsweep, I probably wouldn’t need a __syncthreads() operation since each of the operations are operating in lock-step within a warp, so I separated the for loop into two different loops, one that performs the first 5 steps of upsweep in lockstep, and one that performs the last 3 steps using __syncthreads() operations in between. However, this fails for some reason. Could somebody explain what is going on here?

#define threads 256
#define logthreads 8
__global__ void upsweep(int *arr, int *d_n, int *ws)
{
    __shared__ int heap[2 * threads];
    int left, right, pow2, parent_index;
    heap[threadIdx.x] = 0;
    heap[threadIdx.x + threads] = arr[blockIdx.x * threads + threadIdx.x];
    __syncthreads();
    // Only needs to be done if threadIdx.x % 2 == 0, but might as well compute it for the rest since there is no additional runtime
    left = heap[threadIdx.x + threads];
    right = heap[threadIdx.x + 1 + threads];
    pow2 = 2;
    parent_index = (threads + threadIdx.x) / pow2;
    // This is where I substituted the code for the commented section below.
    for (unsigned char i = 0; i < 5; i++)
    {
        if (threadIdx.x % pow2 == 0)
        {
            left += right;
            heap[parent_index] = left;
            right = heap[parent_index + 1];  // reassign for next layer
            parent_index = parent_index >> 1; // reassign for next layer
        }
        pow2 = pow2 << 1;
    }

    __syncthreads();

    for (unsigned char i = 5; i < logthreads; i++)
    {
        if (threadIdx.x % pow2 == 0)
        {
            left += right;
            heap[parent_index] = left;
        }
        __syncthreads();
        if (threadIdx.x % pow2 == 0)
        {
            right = heap[parent_index + 1];  // reassign for next layer
            parent_index = parent_index >> 1; // reassign for next layer
        }
        pow2 = pow2 << 1;
    }
    /*

    for (unsigned char i = 0; i < logthreads; i++)
    {
        if (threadIdx.x % pow2 == 0)
        {
            left += right;
            heap[parent_index] = left;
        }
        __syncthreads();
        if (threadIdx.x % pow2 == 0)
        {
            right = heap[parent_index + 1];  // reassign for next layer
            parent_index = parent_index >> 1; // reassign for next layer
        }
        pow2 = pow2 << 1;
    }
    */
    // additional code after here...
}

With Independent Thread Scheduling introduced in the Volta generation threads within a warp are not guaranteed to execute in lock-step.

This sort of thinking is officially deprecated and unsafe. Code written that way is by definition broken, regardless of what output it produces.

Thank you for the link! I read through the article; is it correct to say that warps operating in lock-step is only guaranteed when using the warp-level primitives or after a __syncwarp() operation with the proper bitmask?

I see, that explains the confusion I had with this concept. Thanks for your help!

syncwarp is an execution barrier, and like syncthreads, it also has some memory-barrier effects. You can read about its semantics in the programming guide. It tells you nothing about what will happen after the syncwarp. Assuming that syncwarp or syncthreads guarantees some sort of convergence after the barrier is again incorrect thinking.

Yes, _sync()-variant primitives that accept a bitmask guarantee for that instruction only, that there will be the necessary synchronization to provide the requested warp-convergence. (This assumes you have not prevented the requested convergence via forced divergence through conditional code that prevents an appropriate activity. This is a complex topic to spell out in all detail, with appropriate disclaimers based on GPU architecture)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.