C-level Warp Shuffle functions in CUDA 4.2 final Not just for PTX anymore

In case anyone missed it, the warp shuffle functionality of compute capability 3.0 has also been exposed at the CUDA C level. See the latest CUDA C Programming Guide:


Section B.13 gives the details, and has some nice examples, including broadcast and warp-level reduction without shared memory.

Note for future documentation updates: I don’t think the variable “laneID” is defined anywhere. Presumably, it is a built-in variable, like warpSize?

Ah, it must be a special variable in 4.2.

If it’s not or you’re on an older compiler and you don’t want to mask threadIdx, then this will grab it for you:

__device__ __forceinline__ unsigned int laneId()


  unsigned int id;

asm("mov.u32 %0, %%laneid;" : "=r"(id));

return id;


Maybe it’s coming in a future release because right now “laneId” throws an error with nvcc 4.2:

error: identifier "laneId" is undefined

Thanks for bringing this issue with the Programming Guide to our attention. It will be fixed for the next CUDA release.

thanks for the heads-up Seibert! I have so many kernels to optimize/test where this replaces shared memory :)

edit: will be curious to see what the performance will be like…

I finally have a Kepler to play with (Macbook with GT650m) so of course I experimented with the new sm_30 shuffles operator first.
Two small surprises:

  1. The shuffle up and shuffle down instructions do not have index wrap around. This is carefully documented, but surprising that the behavior is for a thread to do nothing (return its own value) if a thread shifts too far. Not a problem at all, but kind of surprising.

  2. The core __shfl(value, sourceLane) operation is much more powerful than I expected, even after reading the docs. You pass in an argument of which lane you want to read from. The example given by the docs is to have all threads read a value stored in a specific thread… kind of a warp broadcast.

What is not obvious is that different threads can specify different source lanes. (The source does not need to be constant across the warp.) Inside the warp, any thread can read from any thread. This means the shuffle is a fully arbitrary register permutation engine! So for example if you want to make a wraparound register shift, you can just specify say ((threadIdx.x + 7)&31) as the source index, and you get a circular shift of 7. Computing the index to read from takes a few ops of course (so shfl_up and shfl_xor will be faster than if you wrote your own with _shfl()), but the sheer flexibility of arbitrary reads is still terrific. If you need to make some permutations often, the indices can of course be precomputed so the repeated call has no extra overhead.

Anyway, thanks to the NV engineers for this new access! I hadn’t read the docs clearly enough to realize that the core _shfl() wasn’t just a single index broadcast, but a general index access, and even as I coded a quick _shfl() test I expected the behavior to be just a broadcast (using lane 0’s index, probably).

I can think of a few interesting applications already! One is to store small arrays of 32 or less in “transposed” registers in the warp. Then a thread can arbitrarily index into the array x[i] by simply querying __shfl(x, i)! This is a quasi-read-only access, but it can be useful.