Question regarding the mask in __shfl_sync operations

I was just hoping to clarify my understanding of the role of the mask in register shuffle operations. Are masks used for synchronization purposes only or do they control which threads participate in a shuffle operation?

Let’s say within a 32 thread warp, I have 4 different masks distributed as follows:
Threads 0-7: 0x000000FF
Threads 8-15: 0x0000FF00
Threads 16-23: 0x00FF0000
Threads 24-31: 0xFF000000

If I call __shfl_up_sync with a delta value of 1, will threads 8, 16, and 24 still receive information from 7, 15, and 23, respectively? Or will they receive no data?

Similarly, if I use __shfl_sync with an index outside of the range of the mask, will the ID’s wrap around the way they do when I use the full mask? (e.g. accessing thread 8 from thread 7 would give the value from thread 0)

My testing (on an older Kepler card) seems to indicate that they only control which threads are synchronized before the shuffle call, because I can still get values from threads outside the mask. I just wanted to verify this behavior for the Volta architecture, where I plan on deploying.

Some more background on what I’m interested in, I would like to perform shuffle reduction operations within parts of a warp. (e.g. performing 4 dot products of 8 values each within a warp) The role of the mask determines how easy/difficult this is to program. Additionally I have seen the article on cooperative thread groups. Perhaps that would be a better route?


Starting with your last question, yes. If you are going to combine Volta+ hardware and the CUDA 9.x+ software stack, you need to use cooperative groups (CG). They are more flexible and powerful.

Referring to the Programming Guide

Question 1, a mask tells the command, *__sync, which threads must be converged before the intrinsic is executed. Participation would be determined by which thread is active.

Question 2, I don’t quite understand the rationale for 4 different mask. I’m assuming you want to partition a warp, of 32 threads, into 4 sub-partitions of 8 threads?? If yes, this can be done easily with CGs.

__shfl_up_sync() calculates a source lane ID by subtracting delta from the caller’s lane ID. The value of var held by the resulting lane ID is returned: in effect, var is shifted up the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged.

Question 3, If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection).

Question 4, again this can be accomplished more easily with CG.

Thanks for the response. The rationale for 4 different masks is as you said, so I can partition a warp into 4 parts with 8 threads each. I will look more into cooperative groups for this task.

Thanks again.