Originally published at: Register Cache: Caching for Warp-Centric CUDA Programs | NVIDIA Technical Blog
Figure 1: Execution and Memory hierarchy in CUDA GPUs. In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive. This abstraction helps optimize kernels that use shared memory…
thanks, this was fun!
I've been avoiding shared memory likes its the plague since 2009, was really stoked when warp shuffling came along :-)
There are a surprising amount of kernels where people use shared memory when they actually don't need it nor warp-shuffles, they can just do more elements per thread instead.
//Jimmy
The use of __activemask() here is wrong and highly unsafe. https://devblogs.nvidia.com... especially mentions not to "just use __activemask() as the mask" for the *_sync operations.
Reasoning would be: You want to shuffle/publish the value of thread 0 of the warp. For whatever reason this thread is blocked (e.g. cache miss while the others had a cache hit) so it doesn't get included into __activemask() and doesn't participate in the __shfl_sync. So the result is undefined!
On Pascal this does not matter as we have lock-step execution, but on Volta you might run into this. Please update your examples to use e.g. FULL_MASK and mention this problem so others won't fall in the same trap.
In pre-Volta GPUs each warp maintained a single program counter (PC), pointing to the next instruction executed by the warp as well as a mask of all the currently active threads in the warp. Independent thread scheduling in Volta GPUs maintains a PC for every thread, enabling separate and independent execution flows of threads in a single warp, which gives more freedom to the GPU scheduler.
Hi, I do not understand what does this mean? How “Independent thread scheduling” influence this “register cache” technique? Thanks!!!