CUDA Pro Tip: Do The Kepler Shuffle

jwitsoe · January 18, 2018, 6:03pm

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-kepler-shuffle/

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a…

anon22028788 · November 4, 2018, 3:10am

What happens if I warp shuffle in a thread block of size 1024? Are there 32 warps or just 1?

anon95180265 · November 5, 2018, 12:38am

32. Warps on all current and past architectures have 32 threads.

anon22028788 · November 5, 2018, 8:56am

Thank you. Would it decrease warp instruction bottleneck (32 shuffles throughput) if I pack x,y,z,w variables into a struct and shuffle it instead, in a tight loop, rather than do x,y,z,w shuffles one after another? Does the 32 shuffles per cycle throughput limited by bandwidth or instruction isssue throughput?

I'm optimizing an nbody algorithm with 20 flops in its unit work. If I add x,y,z,m shuffles, it gets bottlenecked. I mean, there is a loading part that 32 warps (in a 1024-thread block) load same 32 x,y,z,m values then they all start processing by their own shuffles for 32 times by the broadcasting shuffle 0xffffffff, value, counter parameters.

Topic		Replies	Views
Shuffle instructions on Kepler: how implemented? CUDA Programming and Performance	3	1782	March 6, 2013
a simple shuffle example? CUDA Programming and Performance	7	1726	November 4, 2014
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10472	April 5, 2012
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	9018	November 29, 2013
Is it possible to improve the performance of N-body interaction program with warp shuffle? CUDA Programming and Performance	4	1192	April 25, 2014
Why is __shfl slower than shared memory CUDA Programming and Performance	7	5954	November 27, 2014
C-level Warp Shuffle functions in CUDA 4.2 final Not just for PTX anymore CUDA Programming and Performance	5	4159	June 28, 2012
Does __shfl_*() contains an implicit sync? CUDA Programming and Performance	7	1517	January 31, 2017
How to calculate shared memory bandwidth? CUDA Programming and Performance	5	2542	June 9, 2012
Reduction through shared memory vs. shuffle CUDA Programming and Performance	8	7171	April 29, 2014

CUDA Pro Tip: Do The Kepler Shuffle

Related topics