So, I was reading and experimenting with the WARP level primitives mentioned in the blog CUDA WARP LEVEL PRIMITIVES.
Section 1 :
Under the section Implicit Warp-Synchronous Programming is Unsafe they have this example in Listing 13
v = foo();
if (threadIdx.x % 2) {
__syncwarp(); //line 3
v = __shfl(0); // L3 will get undefined result because lane 0
__syncwarp(); // is not active when L3 is executed. L3 and L6
} else { // will execute divergently.
__syncwarp(); // line 7
v = __shfl(0);
__syncwarp();
}
They further mention → “The __syncwarp() at line 3 and line 7 would ensure foo() is called by all threads in the warp before line 4 or line 8 is executed”
So, the above statement means all 32 threads of the warp will call foo() before line 4 or line 8. It does not sounds correct to me.
My understanding - __syncwarp() at line 3 will not ensure threads with even thread ID call foo() before line 4 but will only ensure threads with odd thread ID in the warp call foo() before line 4 and similarly __syncwarp() at line 4 will not ensure threads with odd thread ID call foo() before line 8 but will only ensure threads with even thread ID in the warp call foo() before line 8. So, all threads in a warp calling foo() before line 4 and line 8 is untrue.
Question - Am i correct in my understanding? If not then what I am missing in my understanding. can you please fill in the gaps?
Section 2 : Synchronized Data Exchange on A100 GPU , compute capability 8.0 ( came post Volta).
So, in the blog CUDA WARP LEVEL PRIMITIVES under Synchronized Data Exchange section they mention → “On Volta and later GPU architectures, the data exchange primitives can be used in thread-divergent branches: branches where some threads in the warp take a different path than the others. Listing 4 shows an example where all the threads in a warp get the value of val from the thread at lane 0. The even- and odd-numbered threads take different branches of an if statement.”
and they provide the following example LISTING 4
// LISTING 4 - they mention FULL_MASK (0xffffffff for 32 threads)
if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 0);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 0);
…
}
My understanding → In the code block, if (threadIdx.x % 2) {… }, threads with odd thread IDs will get value of val from lane 0 and in the else code section threads with even thread IDs will get value of val from lane 0.
Question 2a) : If odd threads receive value from lane 0 in the if code section, then should not the mask be 0x55555555 representing only the threads with odd threads IDs participating in the if part of the code section like in the following code snippet?
if (threadIdx.x % 2) {
val += __shfl_sync(0x55555555, val, 0);
…
}
Also similarly for the else code section should not the mask be 0xAAAAAAAA, representing only the even threads participating like in the following code snippet?
else {
val += __shfl_sync(0xAAAAAAAA, val, 0);
…
}
Description 2b)
In section 7.21.1 syntax for __shfl_sync
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
They further mention → “All of the __shfl_sync() intrinsics take an optional width parameter which alters the behavior of the intrinsic. width must have a value which is a power of two in the range [1, warpSize] (i.e., 1, 2, 4, 8, 16 or 32)”
Question 2b.1)
From the above definition of WIDTH , I only perceive that default WIDTH is less than and equal to 32 and should be a power of 2. I know MASK indicates the participating threads of a warp but what does WIDTH parameter indicate? Does it indicate the total number of participating threads?
Question 2b.2)
So, in the above LISTING 4 in both if and else code section they use the WIDTH parameter value to be 32. But, if the WIDTH parameter indicates the number of participating THREADS should not WIDTH be set to 16 in both if and else statements like in the following code snippet? Because in both if and else code section only 16 threads of a WARP enter
if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 16);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 16);
…
}
Also what happens when the total number of participating threads are ODD, since the WIDTH parameter can not be ODD, has to be even in that scenario do we assign WIDTH = number_of_participating_threads + 1.
Question - I am not sure if my understanding correct, I may be wrong. If I am wrong in my understanding, can you please correct me?
Scenario :
So the above code snippet in LISTING 4 produces expected output if I use (a) default WIDTH = 32 , (b) call the kernel with 32 threads and (c) use the mask 0xffffffff but if I slightly modify it by setting (a) WIDTH parameter to 4 instead of 32 (b)call the kernel with 4 threads instead of 32 threads and (c)set mask to 0xf0000000 representing only the four threads participating threads instead of 0xffffffff, I receive un-expected output.. Following is the code for it.
Please note : if there is no thread divergence (no if else statement), it produces expected output even with mask 0xf0000000, number of threads executing the kernel 4 and WIDTH =4.
#include <stdio.h>
__global__ void test_ex1(int width1, unsigned mask1 )
{
int val= 5;
if (threadIdx.x == 0)
{
val = 55;
}
if(threadIdx.x %2)
{
val = __shfl_sync(mask1, val , 0, width1 );
}
else
{
val = __shfl_sync(mask1, val , 0, width1 );
}
printf("tid = %d val = %d\n", threadIdx.x, val);
}
int main() {
int width = 32;
int width1 = 4;
unsigned mask = 0xffffffff, mask1 = 0xf0000000;
//test_ex1<<<1,width>>>(width, mask);
test_ex1<<<1,width1>>>(width1, mask1);
cudaDeviceSynchronize();
return 0;
}
Real time OUTPUT of the above code:
tid = 0 val = 55
tid = 1 val = 0
tid = 2 val = 55
tid = 3 val = 0
But the EXPECTED OUTPUT of the above code :
tid = 0 val = 55
tid = 1 val = 55
tid = 2 val = 55
tid = 3 val = 55
Question 2b.3) - In the above code snippet why are threads with thread IDs 1 and 2 not receiving val 55 from lane 0 instead receiving a garbage value 0? Can you please help me out with this?