Shuffle instructions on Kepler: how implemented?

From the NVIDIA Kepler Architecture Whitepaper:

Kepler implements Shuffle instructions allowing threads within a warp to share data. Previously, sharing data between threads within a warp required separate store and load operations to pass the data through the shared memory. With Shuffle instructions, threads within a warp can read values from other threads in the warp.

What does it happen, under the hood, when invoking Shuffle instructions? By which kind of memory do the thread exchange data?

Thank you very much in advance.

No memory is needed at all, just a crossbar switch.
I would guess that the same crossbar is reused that is also used for memory access, but I have no knowledge of the internal architecture (nor would that be needed for CUDA programming).

Thanks for your answer. However, I must say that I do not understand your statement: “no memory is needed at all”. I also think that I should better know the internal GPU architecture (Kepler in this case) to optimize CUDA codes. Therefore, let me ask two further questions:

  1. Do the exchanged data involved by the shuffle instructions reside in SMX registers?
  2. Is there in Kepler any particular circuitry interconnecting the registers and enabling those instructions?

Thanks again.

OK let me modify my statement then: No memory is needed apart from the SMX registers that hold the data before and after the instruction.
However there is no need to route the exchange via additional registers or memory (unlike previous compute capabilities, where shared memory was the only to move data between threads within the SM).

If my assumption is true that the memory crossbar is reused for the shuffle instruction, then additional circuitry is needed to route register accesses via that crossbar. Otherwise an extra crossbar would be needed.