Warp shuffles with width 1

I have an algorithm that dynamically adapts to the size of the problem. I’m filling out a dynamic programming matrix, and processing this matrix using either 1, 2, 4, 8, 16 or 32 threads at a time. For this, at the end of each row, I have to exchange data using warp shuffles. For the case of 1 thread at a time, there’s obviously no data to exchange, since every thread works on independent data. In order to keep the algorithm as generic as possible, I’d like to keep the warp shuffle I need for >1 threads working on the same matrix also for the case of 1 thread. The CUDA manual is a bit vague on this topic. It mentions “The width must be a power-of-2 (i.e., 2, 4, 8, 16 or 32)” which is a bit odd, since 1 is a power-of-2. I’ve tested code like

val = __shfl_sync(-1, val, 0, 1);

and it seems to be working just fine and gives the expected result (i.e., every thread keeps its data). Does anyone have more insight on this? By explicitly allowing the width=1 case, I can avoid a ton of special casing for this one instance, since the code otherwise can easily be made generic for all of the other width cases.

Only NVIDIA can provide an authoritative response. One straightforward way to get that would be to file a bug report to have the documentation clarified. That would solve your immediate problem and help other CUDA programmers.

IMHO, from a hardware implementation perspective it makes sense for the case of width=1 to work consistent with your observations.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.