Warp shuffles with width 1

dseifert · July 21, 2023, 7:40am

I have an algorithm that dynamically adapts to the size of the problem. I’m filling out a dynamic programming matrix, and processing this matrix using either 1, 2, 4, 8, 16 or 32 threads at a time. For this, at the end of each row, I have to exchange data using warp shuffles. For the case of 1 thread at a time, there’s obviously no data to exchange, since every thread works on independent data. In order to keep the algorithm as generic as possible, I’d like to keep the warp shuffle I need for >1 threads working on the same matrix also for the case of 1 thread. The CUDA manual is a bit vague on this topic. It mentions “The width must be a power-of-2 (i.e., 2, 4, 8, 16 or 32)” which is a bit odd, since 1 is a power-of-2. I’ve tested code like

val = __shfl_sync(-1, val, 0, 1);

and it seems to be working just fine and gives the expected result (i.e., every thread keeps its data). Does anyone have more insight on this? By explicitly allowing the width=1 case, I can avoid a ton of special casing for this one instance, since the code otherwise can easily be made generic for all of the other width cases.

njuffa · July 21, 2023, 7:55am

Only NVIDIA can provide an authoritative response. One straightforward way to get that would be to file a bug report to have the documentation clarified. That would solve your immediate problem and help other CUDA programmers.

IMHO, from a hardware implementation perspective it makes sense for the case of width=1 to work consistent with your observations.

system · August 4, 2023, 7:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Requesting clarification - CUDA WARP level primitives and THREAD divergence CUDA Programming and Performance hw , cuda	3	1214	February 14, 2024
a simple shuffle example? CUDA Programming and Performance	7	1798	November 4, 2014
Isn't there any shfl in cooperative group(larger than 32)? CUDA Programming and Performance	2	394	February 1, 2024
Do I need to write my own warp-wide broadcasting function or will __shfl handle it efficiently? CUDA Programming and Performance	3	1662	May 30, 2017
How to structure multiple concurrent shuffles in the same block? CUDA Programming and Performance	2	913	February 27, 2019
Does __shfl_*() contains an implicit sync? CUDA Programming and Performance	7	1588	January 31, 2017
shuffling warp CUDA Programming and Performance	3	968	March 12, 2018
C-level Warp Shuffle functions in CUDA 4.2 final Not just for PTX anymore CUDA Programming and Performance	5	4245	June 28, 2012
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	9198	November 29, 2013
__shfl_down_sync weird behavior CUDA Programming and Performance cuda , ubuntu	5	1594	November 23, 2021

Warp shuffles with width 1

Related topics