CUDA Pro Tip: Do The Kepler Shuffle

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-kepler-shuffle/

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a…

What happens if I warp shuffle in a thread block of size 1024? Are there 32 warps or just 1?

32. Warps on all current and past architectures have 32 threads.

Thank you. Would it decrease warp instruction bottleneck (32 shuffles throughput) if I pack x,y,z,w variables into a struct and shuffle it instead, in a tight loop, rather than do x,y,z,w shuffles one after another? Does the 32 shuffles per cycle throughput limited by bandwidth or instruction isssue throughput?

I'm optimizing an nbody algorithm with 20 flops in its unit work. If I add x,y,z,m shuffles, it gets bottlenecked. I mean, there is a loading part that 32 warps (in a 1024-thread block) load same 32 x,y,z,m values then they all start processing by their own shuffles for 32 times by the broadcasting shuffle 0xffffffff, value, counter parameters.