Using CUDA Warp-Level Primitives

Originally published at: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

Figure 1: The Tesla V100 Accelerator with Volta GV100 GPU. SXM2 Form Factor. NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Many CUDA programs achieve high performance by taking advantage of warp execution. In this blog we show how to use primitives introduced in CUDA 9 to make your…

Hi guys,
there is a small mistake in Listing 6: it should be
int y2 = threadIdx.x % 4;

Robin, thanks for pointing out the mistake. Fixed.

This is a very helpful article. I think the information on shuffles in the CUDA C Programming Guide is a bit too brief, especially regarding the mask.

In Listing 8, I think the initialisation should be
int v = shmem[tid];

Do warp primitives work on the pascal architecture? I have a geforce gtx 1050 ti card and i'm trying to get the __shfl_down_sync function to work, but it just returns 0. Please let me know if it works on the Pascal architecture. Thanks!

I have a question on the use of __match_all_sync in Listing 9.
According to CUDA Programming Guide
"__match_all_sync
Returns mask if all threads in mask have the same value for value; otherwise 0 is returned. Predicate pred is set to true if all threads in mask have the same value of value; otherwise the predicate is set to false."

So, is it actually what we want there? How about
"__match_any_sync
Returns mask of threads that have same value of value in mask"?

Thanks

.

The explanation for Listing 4 says "all the threads in a warp get the value of val from the thread at lane 0" -- for this, shouldn't the offset be "-threadIdx.x" rather than 0? When the offset is 0, isn't the code just val+=val?

Regarding "On the latest Volta (and future) GPUs, you can run library functions that use warp synchronous primitives without worrying whether the function is called in a thread-divergent branch." -- so the compiler ensures that any mask we supply, such as -1, is ANDed with a properly determined mask of ACTIVE THREADS (while avoiding the error shown in Listing 5 -- "__active_mask_with_sync()")? Otherwise, in Listing 4, FULL_MASK (=-1) would need to be replaced by "even mask" and "odd mask".

.

1 Like

Do the *_sync() operations (e..g, __shfl_sync()) imply a barrier/memory fence, as __syncwarp()?

Is there a porting guide which will show mapping of kepler warp intrinsics to volta warp instrinsics which could be adopted by legacy applications without learning new semantics?

Thank you so much for this helpful guide.

A question about masks: when you say "N-th" bit, is it from the left or from the right?
so, if I only want the first thread in the warp to participate, would it be 0x0001 or 0x8000?

LSB. So 0x0001 is the first thread. 0xFFFF is threads 0-15, 0x80000000 is thread 31, 0xFFFFFFFF is all 32 threads.

Thanks!

Awesome functionality and great writing! Heads up that __activemask() is misspelled as __active_mask() a few times.

Fixed! Thanks.

In "Update Legacy Warp-Level Programming", it says "Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang."

but in listing 4,
if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 0);

}
else {
val += __shfl_sync(FULL_MASK, val, 0);

}
and it says "On the latest Volta (and future) GPUs, you can run library functions that use warp synchronous primitives without worrying whether the function is called in a thread-divergent branch." and You just used FULL_MASK inside branch.

I tested ballot_sync myself using FULL_MASK inside nested control flow statements in various cases, it produced unexpected outputs or even deadlocked in some cases. So I guess it is not safe to use FULL_MASK inside arbitrary branch, at least for ballot_sync.

So, I wonder how do I interpret listing 4?

In listing 4, regardless the thread id is even or odd, the thread in the warp will always execute one of the two __shfl_sync() statements. Therefore, FULL_MASK should be used.

The following code may cause a stall.


if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 0);

}
else {

}

Thanks for clarifying!
Does this mean it is only safe to use FULL_MASK inside non-nested if-else statements, where all threads must execute one or the other path?
For example, the below code would not be safe?
if (some_condition){
if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 0);

}
else {
val += __shfl_sync(FULL_MASK, val, 0);

}
}

I would guess that this is a typo, yes!

Given the documentation and the intended use case, it should be __match_any_sync.

I don’t understand why __syncwarp() is needed in listing 8.

The article said:
“The CUDA programming model does not guarantee that all the reads will be performed before all the writes.”

I come up with a situation where sync is needed, like blow:

//Assume only 32 threads in a block.
volatile __shared__ int data[32];
if (threadIdx.x % 2 == 0) {
       //Do something
} else {
      //Do something
}

data[threadIdx.x] = threadIdx.x;
int value = data[(threadIdx.x + 1) % 32];

//Write value to global memory

In the code above, as hardware may not re-convergence after else, the threads in same warp may not execute the same instruction. So the final result is undefined.

But what if there is no such situation, I mean, there is no warp-level divergency? is it necessary to add __syncwarp()? Assume shared memory array is decorated by volatile.

//Assume only 32 threads in a block.
__global__ void kernel(int *out, int *in) {
      volatile __shared__ int data[64];
      data[threadIdx.x] = in[threadIdx.x];
      //Do I need __syncwarp()?
      data[threadIdx.x] += data[threadIdx.x + 16];
      data[threadIdx.x] += data[threadIdx.x + 8];
      data[threadIdx.x] += data[threadIdx.x + 4];
      data[threadIdx.x] += data[threadIdx.x + 2];
      data[threadIdx.x] += data[threadIdx.x + 1];

      //data[0] should be sum of in[0..31]
       out[threadIdx.x] = data[0];
}

If it’s still necessary, please tell me why some threads may not read data written by other threads.

CUDA C++ programs following this programming model are guaranteed to work correctly for all future HWs. Compiler will remove the __syncwarp if it is not needed on a particular target.