I’ve been trying to understand the why and when of constant memory and I stumbled upon some threads on constant cache. I want to understand how constant cache works but it’s hard to find any information on it.
What I’ve been able to find is this entry in the docs saying that there is a constant cache unit per SM and it’s separate from the L1 cache. But all of the other sources, namely the nsight-compute profiling guide and the architecture pdf do not mention it at all and don’t include it in any of the diagrams.
Is there any information on constant cache? More specifically:
Is it separate from the L1 Cache? Meaning does using constant memory in place of global memory free up space in the L1 for global variables?
Constant memory should be used, if all threads within a warp read the same data, for example coefficients.
Figure 3.1 (page 20) of the Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking paper https://arxiv.org/pdf/1804.06826 shows a nice diagram.
PTX offers access to around (depending on version) 640 KB of constant memory comprising of 10 banks with 64 KB each. But the constant cache is much smaller.
Before Volta there were actually 18 banks, since Volta 26 banks. But those additional ones are used for internal purposes. E.g. mathematical constants.
Besides freeing L1 space, it is an additional way to get data for operands. L1 bandwidth is limited. Some memory-heavy algorithms use up the bandwidth. Yes, you can use registers, or you can use immediate values in the instructions. But constant data compared to those can be dynamically indexed. That makes it usable for different tables or within loops without duplicating code.
Shared memory bandwidth is also limited (shared memory can also be dynamically indexed).
There are some ways to load global device memory through the constant caches and path.
All the links provided are great, thanks a lot rs277 and Curefab. The volta paper is interesting, I’ve been looking for similar resources on Ada and Ampere and couldn’t find any, does that still relate? Also they are mentioning the use of L1 and L1.5 constant cache but I’ve not been able to find any details on L1.5, is it mentioned anywhere in the official documentation?
At least the 8KiB figure is also stated in the Programming Guide CUDA C++ Programming Guide in table 21 as “Cache working set per SM for constant memory”. Or do they mean the L1 instruction cache (as it is 8 KiB in the Dissecting pdf, the L1 with 2 KiB is per SM, otherwise if it would be per SM Partition, 4 * 2 KiB would be 8 KiB)?