Register Cache: Caching for Warp-Centric CUDA Programs

Originally published at:

Figure 1: Execution and Memory hierarchy in CUDA GPUs. In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive. This abstraction helps optimize kernels that use shared memory…

thanks, this was fun!

I've been avoiding shared memory likes its the plague since 2009, was really stoked when warp shuffling came along :-)

There are a surprising amount of kernels where people use shared memory when they actually don't need it nor warp-shuffles, they can just do more elements per thread instead.


The use of __activemask() here is wrong and highly unsafe. especially mentions not to "just use __activemask() as the mask" for the *_sync operations.

Reasoning would be: You want to shuffle/publish the value of thread 0 of the warp. For whatever reason this thread is blocked (e.g. cache miss while the others had a cache hit) so it doesn't get included into __activemask() and doesn't participate in the __shfl_sync. So the result is undefined!

On Pascal this does not matter as we have lock-step execution, but on Volta you might run into this. Please update your examples to use e.g. FULL_MASK and mention this problem so others won't fall in the same trap.