Register Cache: Caching for Warp-Centric CUDA Programs

jwitsoe · October 12, 2017, 5:26am

Originally published at: Register Cache: Caching for Warp-Centric CUDA Programs | NVIDIA Technical Blog

Figure 1: Execution and Memory hierarchy in CUDA GPUs. In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive. This abstraction helps optimize kernels that use shared memory…

anon42921929 · October 13, 2017, 9:48pm

thanks, this was fun!

I've been avoiding shared memory likes its the plague since 2009, was really stoked when warp shuffling came along :-)

There are a surprising amount of kernels where people use shared memory when they actually don't need it nor warp-shuffles, they can just do more elements per thread instead.

//Jimmy

anon89043345 · February 16, 2018, 12:23pm

The use of __activemask() here is wrong and highly unsafe. https://devblogs.nvidia.com... especially mentions not to "just use __activemask() as the mask" for the *_sync operations.

Reasoning would be: You want to shuffle/publish the value of thread 0 of the warp. For whatever reason this thread is blocked (e.g. cache miss while the others had a cache hit) so it doesn't get included into __activemask() and doesn't participate in the __shfl_sync. So the result is undefined!

On Pascal this does not matter as we have lock-step execution, but on Volta you might run into this. Please update your examples to use e.g. FULL_MASK and mention this problem so others won't fall in the same trap.

202476410arsmart · January 31, 2024, 9:04am

In pre-Volta GPUs each warp maintained a single program counter (PC), pointing to the next instruction executed by the warp as well as a mask of all the currently active threads in the warp. Independent thread scheduling in Volta GPUs maintains a PC for every thread, enabling separate and independent execution flows of threads in a single warp, which gives more freedom to the GPU scheduler.

Hi, I do not understand what does this mean? How “Independent thread scheduling” influence this “register cache” technique? Thanks!!!

Topic		Replies	Views
Using CUDA Warp-Level Primitives Technical Blog	20	1986	April 15, 2024
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12481	February 7, 2023
Using Shared Memory in CUDA C/C++ Technical Blog	36	1999	October 8, 2020
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1571	June 15, 2011
GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA Technical Blog	8	799	November 22, 2019
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics Technical Blog	8	830	May 29, 2021
Faster Parallel Reductions on Kepler Technical Blog	53	1843	September 4, 2021
Problem with correct branching within a warp CUDA Programming and Performance	23	15646	May 28, 2009
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29327	April 23, 2010
Newbie - Need to use shared mem? CUDA Programming and Performance	27	14988	December 17, 2008

Register Cache: Caching for Warp-Centric CUDA Programs

Related topics