How is shfl_sync implemented?

Ziqi · June 2, 2022, 4:30pm

I am curious about the implementation of the set of shfl_sync functions described at the following link: programming guide. For instance, the function T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize), when called, seems to be a case of pass-by-value. But in effect, what the function really checks is the name var and it finds the local variable of the same name from another thread, and returns that. I would like to understand the implementation for this feature of “find by name” in CUDA.

Robert_Crovella · June 2, 2022, 4:37pm

That’s not how it works.

The parameter var represents the quantity (yes, it is pass-by-value) that the thread will make available for other warp-lanes to read. Other warp lanes that index to that lane to get their incoming value will select whatever that thread has placed in its var argument.

A thread gets its own data from other threads via the function return value.

Each thread places a value to be read by other threads in var
Each thread selects a source lane to read from (srcLane) (and the exact indexing here will vary depending on which shfl variant is being used)
Each thread reads the var value from the targetted source lane
The value read is returned by that thread in the function return value
each thread decides what it will publish. No other thread gets to select the value that will be read from that thread. The only value that can be read from that thread is the value that thread published in var.

There is no “find by name”.

A thread can pass a value for publishing via var, and also use the same variable for assignment as the return value. The value published will be the value that was contained in that variable prior to the shuffle op. The value that will “end up” in that variable is the value that was returned by the shuffle op.

Ziqi · June 2, 2022, 4:46pm

Elegant solution and good explanation! Thanks!