Requesting clarification - CUDA WARP level primitives and THREAD divergence

tavivekuh · February 12, 2024, 12:47am

So, I was reading and experimenting with the WARP level primitives mentioned in the blog CUDA WARP LEVEL PRIMITIVES.

Section 1 :

Under the section Implicit Warp-Synchronous Programming is Unsafe they have this example in Listing 13

v = foo();
if (threadIdx.x % 2) {
    __syncwarp();  //line 3
    v = __shfl(0);       // L3 will get undefined result because lane 0 
    __syncwarp();        // is not active when L3 is executed. L3 and L6
} else {                 // will execute divergently.
    __syncwarp();  // line 7
    v = __shfl(0);
    __syncwarp();
}

They further mention → “The __syncwarp() at line 3 and line 7 would ensure foo() is called by all threads in the warp before line 4 or line 8 is executed”

So, the above statement means all 32 threads of the warp will call foo() before line 4 or line 8. It does not sounds correct to me.

My understanding - __syncwarp() at line 3 will not ensure threads with even thread ID call foo() before line 4 but will only ensure threads with odd thread ID in the warp call foo() before line 4 and similarly __syncwarp() at line 4 will not ensure threads with odd thread ID call foo() before line 8 but will only ensure threads with even thread ID in the warp call foo() before line 8. So, all threads in a warp calling foo() before line 4 and line 8 is untrue.

Question - Am i correct in my understanding? If not then what I am missing in my understanding. can you please fill in the gaps?

Section 2 : Synchronized Data Exchange on A100 GPU , compute capability 8.0 ( came post Volta).

So, in the blog CUDA WARP LEVEL PRIMITIVES under Synchronized Data Exchange section they mention → “On Volta and later GPU architectures, the data exchange primitives can be used in thread-divergent branches: branches where some threads in the warp take a different path than the others. Listing 4 shows an example where all the threads in a warp get the value of val from the thread at lane 0. The even- and odd-numbered threads take different branches of an if statement.”

and they provide the following example LISTING 4

// LISTING 4 - they mention FULL_MASK (0xffffffff for 32 threads)


if (threadIdx.x % 2) {
    val += __shfl_sync(FULL_MASK, val, 0);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 0);
…
}

My understanding → In the code block, if (threadIdx.x % 2) {… }, threads with odd thread IDs will get value of val from lane 0 and in the else code section threads with even thread IDs will get value of val from lane 0.

Question 2a) : If odd threads receive value from lane 0 in the if code section, then should not the mask be 0x55555555 representing only the threads with odd threads IDs participating in the if part of the code section like in the following code snippet?

if (threadIdx.x % 2) {
    val += __shfl_sync(0x55555555, val, 0);
…
}

Also similarly for the else code section should not the mask be 0xAAAAAAAA, representing only the even threads participating like in the following code snippet?

else {
val += __shfl_sync(0xAAAAAAAA, val, 0);
…
}

Description 2b)
In section 7.21.1 syntax for __shfl_sync

  T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);

They further mention → “All of the __shfl_sync() intrinsics take an optional width parameter which alters the behavior of the intrinsic. width must have a value which is a power of two in the range [1, warpSize] (i.e., 1, 2, 4, 8, 16 or 32)”

Question 2b.1)
From the above definition of WIDTH , I only perceive that default WIDTH is less than and equal to 32 and should be a power of 2. I know MASK indicates the participating threads of a warp but what does WIDTH parameter indicate? Does it indicate the total number of participating threads?

Question 2b.2)

So, in the above LISTING 4 in both if and else code section they use the WIDTH parameter value to be 32. But, if the WIDTH parameter indicates the number of participating THREADS should not WIDTH be set to 16 in both if and else statements like in the following code snippet? Because in both if and else code section only 16 threads of a WARP enter

if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 16);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 16);
…
}

Also what happens when the total number of participating threads are ODD, since the WIDTH parameter can not be ODD, has to be even in that scenario do we assign WIDTH = number_of_participating_threads + 1.

Question - I am not sure if my understanding correct, I may be wrong. If I am wrong in my understanding, can you please correct me?

Scenario :

So the above code snippet in LISTING 4 produces expected output if I use (a) default WIDTH = 32 , (b) call the kernel with 32 threads and (c) use the mask 0xffffffff but if I slightly modify it by setting (a) WIDTH parameter to 4 instead of 32 (b)call the kernel with 4 threads instead of 32 threads and (c)set mask to 0xf0000000 representing only the four threads participating threads instead of 0xffffffff, I receive un-expected output.. Following is the code for it.

Please note : if there is no thread divergence (no if else statement), it produces expected output even with mask 0xf0000000, number of threads executing the kernel 4 and WIDTH =4.

#include <stdio.h>

__global__ void test_ex1(int width1, unsigned mask1 )
{
        int val= 5;

        if (threadIdx.x == 0)
        {
                val = 55;
        }

        if(threadIdx.x %2)
        {
                val = __shfl_sync(mask1, val , 0, width1 );
        }
        else
        {
                val = __shfl_sync(mask1, val , 0, width1 );
        }
        printf("tid = %d  val = %d\n", threadIdx.x, val);
}

int main() {
        int width = 32;
        int width1 = 4;
        unsigned mask = 0xffffffff, mask1 = 0xf0000000;
        //test_ex1<<<1,width>>>(width, mask);
        test_ex1<<<1,width1>>>(width1, mask1);
        cudaDeviceSynchronize();
        return 0;
}

Real time OUTPUT of the above code:

tid = 0  val = 55
tid = 1  val = 0
tid = 2  val = 55
tid = 3  val = 0

But the EXPECTED OUTPUT of the above code :

tid = 0  val = 55
tid = 1  val = 55
tid = 2  val = 55
tid = 3  val = 55

Question 2b.3) - In the above code snippet why are threads with thread IDs 1 and 2 not receiving val 55 from lane 0 instead receiving a garbage value 0? Can you please help me out with this?

Robert_Crovella · February 12, 2024, 2:22am

I would be careful about trying to read too much into the listing 13, which presents an example of inappropriate coding. For me, personally, I wouldn’t try to read more into it than that, or try to make sense of code that is not sensible. However, I suspect the missing piece of info for you is covered in the programming guide. Before trying to understand such things in detail, I would encourage you to read available descriptions from the programming guide, especially when asking about behavior of a specific documented intrinsic (which are pretty much all documented there). Let’s take a look at the description for __syncwarp(). It might not match what you think:

void __syncwarp(unsigned mask=0xffffffff);

will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask) before resuming execution. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute a corresponding __syncwarp() with the same mask, or the result is undefined.

We must unpack that carefully. Note that the mask is a default parameter. That means if not supplied, it defaults (in this case) to 0xFFFFFFFF. Also note the terminology: " will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask)".

"have executed a __syncwarp() ( a __syncwarp()? not the __syncwarp()?)

Doesn’t that sound a little unusual? It does to me. It means that if syncwarp is used in warp-divergent paths (what ?!?!), and the scheduler can do so, it will cause each thread to wait at the syncwarp, until all threads, even in the divergent path, have reached the syncwarp. That means that yes, the usage of syncwarp here guarantees that no thread will reach either line 4 or line 8, before all threads have reached their respective syncwarp, which implies therefore that all threads have executed foo(). (This warp scheduler behavior implies Volta+).

This is presenting (in my view), a rather unusual aspect of warp-divergent behavior. I cover it in some detail here. Briefly, the warp scheduler in Volta and beyond can and will try to “combine” the sync variants of shuffle ops from divergent paths, if necessary, to try and “satisfy” the member mask. If the member mask is satisfied, then the op completes “as if” the warp is converged at least to the extent required by the shuffle mask. Please read the linked article for additional info. Conceptually, there are similarities here with what I responded to with your previous question.

Did you read the description given for the function and parameter in the programming guide section on shuffle ops?

tavivekuh · February 14, 2024, 4:36am

@Robert_Crovella a big thanks to you. It all now makes sense and thank you for sharing your stackoverflow explanation link (PTX link ,perfect). that made everything crystal clear. I had read the documentation in theprogramming guide but my code was not working and I tried to interpret the programming guide documentation differently ( incorrectly ) to make sense for my failing code. thank you for filling in the gaps. also the “a syncwrap” and not “the syncwrap”… thanks for that.

Also, i have the working code now. leaving it here, someone in future might find it useful .

#include <stdio.h>

__global__ void test_ex1(int width1, unsigned mask1 )
{
        int val= 5;

        if (threadIdx.x == 0)
        {
                val = 55;
        }

        if(threadIdx.x %2)
        {
                val = __shfl_sync(mask1, val , 0, width1 );
        }
        else
        {
                val = __shfl_sync(mask1, val , 0, width1 );
        }
        printf("tid = %d  val = %d\n", threadIdx.x, val);
}

int main() {
        int width = 32;
        int width1 = 4;
        unsigned mask = 0xffffffff, mask1 = 0x0000000f;
        //test_ex1<<<1,width>>>(width, mask);
        test_ex1<<<1,width1>>>(width1, mask1);
        cudaDeviceSynchronize();
        return 0;
}

system · February 28, 2024, 4:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using CUDA Warp-Level Primitives Technical Blog	20	2032	April 15, 2024
Why __syncwarp is necessary in undivergent warp reduction? CUDA Programming and Performance	6	3782	April 1, 2022
confuse about warp-level mask CUDA Programming and Performance	12	3110	November 1, 2018
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16889	January 5, 2008
Does __shfl_*() contains an implicit sync? CUDA Programming and Performance	7	1514	January 31, 2017
How do mask work in warp primitive? CUDA Programming and Performance	5	266	May 31, 2023
A question about calculatePartialSum sample code in CUDA_C programming guide CUDA Programming and Performance	13	812	September 29, 2023
about the __syncwarp() in P100 CUDA Programming and Performance	11	4120	June 6, 2018
global to shared mem loads and sync CUDA Programming and Performance	26	11565	February 21, 2008
Race condition within warp CUDA Programming and Performance	9	3067	September 20, 2016

Requesting clarification - CUDA WARP level primitives and THREAD divergence

Related topics