Question regarding the mask in __shfl_sync operations

trf86 · January 16, 2020, 12:22pm

I was just hoping to clarify my understanding of the role of the mask in register shuffle operations. Are masks used for synchronization purposes only or do they control which threads participate in a shuffle operation?

Let’s say within a 32 thread warp, I have 4 different masks distributed as follows:
Threads 0-7: 0x000000FF
Threads 8-15: 0x0000FF00
Threads 16-23: 0x00FF0000
Threads 24-31: 0xFF000000

If I call __shfl_up_sync with a delta value of 1, will threads 8, 16, and 24 still receive information from 7, 15, and 23, respectively? Or will they receive no data?

Similarly, if I use __shfl_sync with an index outside of the range of the mask, will the ID’s wrap around the way they do when I use the full mask? (e.g. accessing thread 8 from thread 7 would give the value from thread 0)

My testing (on an older Kepler card) seems to indicate that they only control which threads are synchronized before the shuffle call, because I can still get values from threads outside the mask. I just wanted to verify this behavior for the Volta architecture, where I plan on deploying.

Some more background on what I’m interested in, I would like to perform shuffle reduction operations within parts of a warp. (e.g. performing 4 dot products of 8 values each within a warp) The role of the mask determines how easy/difficult this is to program. Additionally I have seen the article on cooperative thread groups. Perhaps that would be a better route?

Thanks.

mnicely · January 16, 2020, 2:32pm

Starting with your last question, yes. If you are going to combine Volta+ hardware and the CUDA 9.x+ software stack, you need to use cooperative groups (CG). They are more flexible and powerful.

Referring to the Programming Guide https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-description

Question 1, a mask tells the command, *__sync, which threads must be converged before the intrinsic is executed. Participation would be determined by which thread is active.

Question 2, I don’t quite understand the rationale for 4 different mask. I’m assuming you want to partition a warp, of 32 threads, into 4 sub-partitions of 8 threads?? If yes, this can be done easily with CGs.

__shfl_up_sync() calculates a source lane ID by subtracting delta from the caller’s lane ID. The value of var held by the resulting lane ID is returned: in effect, var is shifted up the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged.

Question 3, If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection).

Question 4, again this can be accomplished more easily with CG. https://devblogs.nvidia.com/cooperative-groups/

trf86 · January 16, 2020, 5:41pm

Thanks for the response. The rationale for 4 different masks is as you said, so I can partition a warp into 4 parts with 8 threads each. I will look more into cooperative groups for this task.

Thanks again.

Topic		Replies	Views
What does mask mean in warp shuffle functions (__shfl_sync) CUDA Programming and Performance	2	5540	November 22, 2018
How do mask work in warp primitive? CUDA Programming and Performance	5	332	May 31, 2023
Shfl_sync question CUDA Programming and Performance	1	584	October 12, 2021
__shfl_sync requires __syncwarp for every call? CUDA Programming and Performance	4	4518	February 25, 2018
confuse about warp-level mask CUDA Programming and Performance	12	3233	November 1, 2018
[q] mask in __shfl_sync() CUDA Programming and Performance cuda , kernel	5	133	September 13, 2024
Warp shuffle over a mask CUDA Programming and Performance	0	492	August 20, 2020
__shfl_up_sync() mask semantics CUDA Programming and Performance	4	78	October 27, 2025
CUDA Warp primitive behaviour question CUDA Programming and Performance	2	299	April 13, 2024
__shfl_down_sync weird behavior CUDA Programming and Performance cuda , ubuntu	5	1575	November 23, 2021

Question regarding the mask in __shfl_sync operations

Related topics