I was just hoping to clarify my understanding of the role of the mask in register shuffle operations. Are masks used for synchronization purposes only or do they control which threads participate in a shuffle operation?
Let’s say within a 32 thread warp, I have 4 different masks distributed as follows:
Threads 0-7: 0x000000FF
Threads 8-15: 0x0000FF00
Threads 16-23: 0x00FF0000
Threads 24-31: 0xFF000000
If I call __shfl_up_sync with a delta value of 1, will threads 8, 16, and 24 still receive information from 7, 15, and 23, respectively? Or will they receive no data?
Similarly, if I use __shfl_sync with an index outside of the range of the mask, will the ID’s wrap around the way they do when I use the full mask? (e.g. accessing thread 8 from thread 7 would give the value from thread 0)
My testing (on an older Kepler card) seems to indicate that they only control which threads are synchronized before the shuffle call, because I can still get values from threads outside the mask. I just wanted to verify this behavior for the Volta architecture, where I plan on deploying.
Some more background on what I’m interested in, I would like to perform shuffle reduction operations within parts of a warp. (e.g. performing 4 dot products of 8 values each within a warp) The role of the mask determines how easy/difficult this is to program. Additionally I have seen the article on cooperative thread groups. Perhaps that would be a better route?