Using CUDA Warp-Level Primitives

jwitsoe · January 18, 2018, 1:41pm

Originally published at: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

Figure 1: The Tesla V100 Accelerator with Volta GV100 GPU. SXM2 Form Factor. NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Many CUDA programs achieve high performance by taking advantage of warp execution. In this blog we show how to use primitives introduced in CUDA 9 to make your…

anon95223721 · January 24, 2018, 2:30pm

Hi guys,
there is a small mistake in Listing 6: it should be
int y2 = threadIdx.x % 4;

anon89777678 · January 24, 2018, 6:44pm

Robin, thanks for pointing out the mistake. Fixed.

anon35019681 · April 10, 2018, 1:21pm

This is a very helpful article. I think the information on shuffles in the CUDA C Programming Guide is a bit too brief, especially regarding the mask.

In Listing 8, I think the initialisation should be
int v = shmem[tid];

anon54973912 · June 3, 2018, 6:05am

Do warp primitives work on the pascal architecture? I have a geforce gtx 1050 ti card and i'm trying to get the __shfl_down_sync function to work, but it just returns 0. Please let me know if it works on the Pascal architecture. Thanks!

anon33283711 · November 30, 2018, 7:06am

I have a question on the use of __match_all_sync in Listing 9.
According to CUDA Programming Guide
"__match_all_sync
Returns mask if all threads in mask have the same value for value; otherwise 0 is returned. Predicate pred is set to true if all threads in mask have the same value of value; otherwise the predicate is set to false."

So, is it actually what we want there? How about
"__match_any_sync
Returns mask of threads that have same value of value in mask"?

Thanks

anon33339112 · February 18, 2019, 3:24am

.

The explanation for Listing 4 says "all the threads in a warp get the value of val from the thread at lane 0" -- for this, shouldn't the offset be "-threadIdx.x" rather than 0? When the offset is 0, isn't the code just val+=val?

Regarding "On the latest Volta (and future) GPUs, you can run library functions that use warp synchronous primitives without worrying whether the function is called in a thread-divergent branch." -- so the compiler ensures that any mask we supply, such as -1, is ANDed with a properly determined mask of ACTIVE THREADS (while avoiding the error shown in Listing 5 -- "__active_mask_with_sync()")? Otherwise, in Listing 4, FULL_MASK (=-1) would need to be replaced by "even mask" and "odd mask".

.

anon36889684 · April 16, 2019, 5:22pm

Do the *_sync() operations (e..g, __shfl_sync()) imply a barrier/memory fence, as __syncwarp()?

anon9602113 · May 6, 2019, 8:52pm

Is there a porting guide which will show mapping of kepler warp intrinsics to volta warp instrinsics which could be adopted by legacy applications without learning new semantics?

anon70852881 · May 16, 2019, 12:53pm

Thank you so much for this helpful guide.

A question about masks: when you say "N-th" bit, is it from the left or from the right?
so, if I only want the first thread in the warp to participate, would it be 0x0001 or 0x8000?

anon95180265 · May 16, 2019, 11:58pm

LSB. So 0x0001 is the first thread. 0xFFFF is threads 0-15, 0x80000000 is thread 31, 0xFFFFFFFF is all 32 threads.

anon70852881 · May 17, 2019, 12:48am

Thanks!

anon89885665 · June 1, 2019, 12:05am

Awesome functionality and great writing! Heads up that __activemask() is misspelled as __active_mask() a few times.

anon95180265 · June 3, 2019, 1:13am

Fixed! Thanks.

anon4610291 · July 15, 2019, 7:23am

In "Update Legacy Warp-Level Programming", it says "Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang."

but in listing 4,
if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 0);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 0);
…
}
and it says "On the latest Volta (and future) GPUs, you can run library functions that use warp synchronous primitives without worrying whether the function is called in a thread-divergent branch." and You just used FULL_MASK inside branch.

I tested ballot_sync myself using FULL_MASK inside nested control flow statements in various cases, it produced unexpected outputs or even deadlocked in some cases. So I guess it is not safe to use FULL_MASK inside arbitrary branch, at least for ballot_sync.

So, I wonder how do I interpret listing 4?

anon89777678 · July 22, 2019, 3:04am

In listing 4, regardless the thread id is even or odd, the thread in the warp will always execute one of the two __shfl_sync() statements. Therefore, FULL_MASK should be used.

The following code may cause a stall.

if (threadIdx.x % 2) { val += __shfl_sync(FULL_MASK, val, 0); … } else { … }

anon4610291 · July 22, 2019, 6:30pm

Thanks for clarifying!
Does this mean it is only safe to use FULL_MASK inside non-nested if-else statements, where all threads must execute one or the other path?
For example, the below code would not be safe?
if (some_condition){
if (threadIdx.x % 2) {
val += __shfl_sync(FULL_MASK, val, 0);
…
}
else {
val += __shfl_sync(FULL_MASK, val, 0);
…
}
}

anon71628816 · November 6, 2019, 12:08pm

I would guess that this is a typo, yes!

Given the documentation and the intended use case, it should be __match_any_sync.

SparkHu · March 30, 2022, 12:15pm

I don’t understand why __syncwarp() is needed in listing 8.

The article said:
“The CUDA programming model does not guarantee that all the reads will be performed before all the writes.”

I come up with a situation where sync is needed, like blow:

//Assume only 32 threads in a block.
volatile __shared__ int data[32];
if (threadIdx.x % 2 == 0) {
       //Do something
} else {
      //Do something
}

data[threadIdx.x] = threadIdx.x;
int value = data[(threadIdx.x + 1) % 32];

//Write value to global memory

In the code above, as hardware may not re-convergence after else, the threads in same warp may not execute the same instruction. So the final result is undefined.

But what if there is no such situation, I mean, there is no warp-level divergency? is it necessary to add __syncwarp()? Assume shared memory array is decorated by volatile.

//Assume only 32 threads in a block.
__global__ void kernel(int *out, int *in) {
      volatile __shared__ int data[64];
      data[threadIdx.x] = in[threadIdx.x];
      //Do I need __syncwarp()?
      data[threadIdx.x] += data[threadIdx.x + 16];
      data[threadIdx.x] += data[threadIdx.x + 8];
      data[threadIdx.x] += data[threadIdx.x + 4];
      data[threadIdx.x] += data[threadIdx.x + 2];
      data[threadIdx.x] += data[threadIdx.x + 1];

      //data[0] should be sum of in[0..31]
       out[threadIdx.x] = data[0];
}

If it’s still necessary, please tell me why some threads may not read data written by other threads.

YuanL · April 7, 2022, 5:51am

CUDA C++ programs following this programming model are guaranteed to work correctly for all future HWs. Compiler will remove the __syncwarp if it is not needed on a particular target.

Topic		Replies	Views
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12467	February 7, 2023
Back to SIMD CUDA Programming and Performance	21	152	November 12, 2024
Problem with correct branching within a warp CUDA Programming and Performance	23	15640	May 28, 2009
Register Cache: Caching for Warp-Centric CUDA Programs Technical Blog	3	622	January 31, 2024
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics Technical Blog	8	826	May 29, 2021
Race condition within warp CUDA Programming and Performance	9	2958	September 20, 2016
Global thread barrier CUDA Programming and Performance	78	85670	December 23, 2011
Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer Technical Blog	1	371	October 7, 2024
A question about calculatePartialSum sample code in CUDA_C programming guide CUDA Programming and Performance	13	749	September 29, 2023
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25626	July 5, 2007

Using CUDA Warp-Level Primitives

Related topics