Constant cache details

I’ve been trying to understand the why and when of constant memory and I stumbled upon some threads on constant cache. I want to understand how constant cache works but it’s hard to find any information on it.

What I’ve been able to find is this entry in the docs saying that there is a constant cache unit per SM and it’s separate from the L1 cache. But all of the other sources, namely the nsight-compute profiling guide and the architecture pdf do not mention it at all and don’t include it in any of the diagrams.

Is there any information on constant cache? More specifically:

  • Is it separate from the L1 Cache? Meaning does using constant memory in place of global memory free up space in the L1 for global variables?
  • How big is the constant cache?
  • Does it fall back to L2 or to constant memory?

Constant memory should be used, if all threads within a warp read the same data, for example coefficients.

Figure 3.1 (page 20) of the Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking paper https://arxiv.org/pdf/1804.06826 shows a nice diagram.

PTX offers access to around (depending on version) 640 KB of constant memory comprising of 10 banks with 64 KB each. But the constant cache is much smaller.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#constant-state-space

Before Volta there were actually 18 banks, since Volta 26 banks. But those additional ones are used for internal purposes. E.g. mathematical constants.

Besides freeing L1 space, it is an additional way to get data for operands. L1 bandwidth is limited. Some memory-heavy algorithms use up the bandwidth. Yes, you can use registers, or you can use immediate values in the instructions. But constant data compared to those can be dynamically indexed. That makes it usable for different tables or within loops without duplicating code.
Shared memory bandwidth is also limited (shared memory can also be dynamically indexed).

There are some ways to load global device memory through the constant caches and path.

Since SM6.1, its 8kB, (Cache working set per SM for constant memory).

Page 28, section 3.4 of the Volta document Curefab linked above gives the details.

This blog post may be of interest, as a method of efficiently passing constants:

All the links provided are great, thanks a lot rs277 and Curefab. The volta paper is interesting, I’ve been looking for similar resources on Ada and Ampere and couldn’t find any, does that still relate? Also they are mentioning the use of L1 and L1.5 constant cache but I’ve not been able to find any details on L1.5, is it mentioned anywhere in the official documentation?

Also is there any way to check the metrics for constant cache?
ncu-ui doesn’t seem to provide it

At least the 8KiB figure is also stated in the Programming Guide CUDA C++ Programming Guide in table 21 as “Cache working set per SM for constant memory”. Or do they mean the L1 instruction cache (as it is 8 KiB in the Dissecting pdf, the L1 with 2 KiB is per SM, otherwise if it would be per SM Partition, 4 * 2 KiB would be 8 KiB)?

For (up to) Ampere values, you could look into “Capturing the Memory Topology of GPUs”, a thesis at the TUM: https://mediatum.ub.tum.de/doc/1689994/1689994.pdf

L1 is always around 2 KiB, L1.5 between 32 KiB and 64 KiB.

1 Like

You may want to ask questions like that on the nsight compute forum.

From memory, there are few constant specific metrics, but Greg covers some constant info here.