I have an algorithm that dynamically adapts to the size of the problem. I’m filling out a dynamic programming matrix, and processing this matrix using either 1, 2, 4, 8, 16 or 32 threads at a time. For this, at the end of each row, I have to exchange data using warp shuffles. For the case of 1 thread at a time, there’s obviously no data to exchange, since every thread works on independent data. In order to keep the algorithm as generic as possible, I’d like to keep the warp shuffle I need for >1 threads working on the same matrix also for the case of 1 thread. The CUDA manual is a bit vague on this topic. It mentions “The width must be a power-of-2 (i.e., 2, 4, 8, 16 or 32)” which is a bit odd, since 1 is a power-of-2. I’ve tested code like
val = __shfl_sync(-1, val, 0, 1);
and it seems to be working just fine and gives the expected result (i.e., every thread keeps its data). Does anyone have more insight on this? By explicitly allowing the width=1 case, I can avoid a ton of special casing for this one instance, since the code otherwise can easily be made generic for all of the other width cases.