Register Cache: Caching for Warp-Centric CUDA Programs

Originally published at: Register Cache: Caching for Warp-Centric CUDA Programs | NVIDIA Technical Blog

Figure 1: Execution and Memory hierarchy in CUDA GPUs. In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive. This abstraction helps optimize kernels that use shared memory…

1 Like

thanks, this was fun!

I've been avoiding shared memory likes its the plague since 2009, was really stoked when warp shuffling came along :-)

There are a surprising amount of kernels where people use shared memory when they actually don't need it nor warp-shuffles, they can just do more elements per thread instead.

//Jimmy

The use of __activemask() here is wrong and highly unsafe. https://devblogs.nvidia.com... especially mentions not to "just use __activemask() as the mask" for the *_sync operations.

Reasoning would be: You want to shuffle/publish the value of thread 0 of the warp. For whatever reason this thread is blocked (e.g. cache miss while the others had a cache hit) so it doesn't get included into __activemask() and doesn't participate in the __shfl_sync. So the result is undefined!

On Pascal this does not matter as we have lock-step execution, but on Volta you might run into this. Please update your examples to use e.g. FULL_MASK and mention this problem so others won't fall in the same trap.

In pre-Volta GPUs each warp maintained a single program counter (PC), pointing to the next instruction executed by the warp as well as a mask of all the currently active threads in the warp. Independent thread scheduling in Volta GPUs maintains a PC for every thread, enabling separate and independent execution flows of threads in a single warp, which gives more freedom to the GPU scheduler.

Hi, I do not understand what does this mean? How “Independent thread scheduling” influence this “register cache” technique? Thanks!!!