Constant cache details

szymon.ozog · September 3, 2024, 8:08am

I’ve been trying to understand the why and when of constant memory and I stumbled upon some threads on constant cache. I want to understand how constant cache works but it’s hard to find any information on it.

What I’ve been able to find is this entry in the docs saying that there is a constant cache unit per SM and it’s separate from the L1 cache. But all of the other sources, namely the nsight-compute profiling guide and the architecture pdf do not mention it at all and don’t include it in any of the diagrams.

Is there any information on constant cache? More specifically:

Is it separate from the L1 Cache? Meaning does using constant memory in place of global memory free up space in the L1 for global variables?
How big is the constant cache?
Does it fall back to L2 or to constant memory?

Curefab · September 3, 2024, 2:18pm

Constant memory should be used, if all threads within a warp read the same data, for example coefficients.

Figure 3.1 (page 20) of the Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking paper https://arxiv.org/pdf/1804.06826 shows a nice diagram.

PTX offers access to around (depending on version) 640 KB of constant memory comprising of 10 banks with 64 KB each. But the constant cache is much smaller.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#constant-state-space

Before Volta there were actually 18 banks, since Volta 26 banks. But those additional ones are used for internal purposes. E.g. mathematical constants.

Besides freeing L1 space, it is an additional way to get data for operands. L1 bandwidth is limited. Some memory-heavy algorithms use up the bandwidth. Yes, you can use registers, or you can use immediate values in the instructions. But constant data compared to those can be dynamically indexed. That makes it usable for different tables or within loops without duplicating code.
Shared memory bandwidth is also limited (shared memory can also be dynamically indexed).

There are some ways to load global device memory through the constant caches and path.

rs277 · September 3, 2024, 7:06pm

Since SM6.1, its 8kB, (Cache working set per SM for constant memory).

Page 28, section 3.4 of the Volta document Curefab linked above gives the details.

This blog post may be of interest, as a method of efficiently passing constants:

szymon.ozog · September 4, 2024, 6:45am

All the links provided are great, thanks a lot rs277 and Curefab. The volta paper is interesting, I’ve been looking for similar resources on Ada and Ampere and couldn’t find any, does that still relate? Also they are mentioning the use of L1 and L1.5 constant cache but I’ve not been able to find any details on L1.5, is it mentioned anywhere in the official documentation?

szymon.ozog · September 4, 2024, 7:56am

Also is there any way to check the metrics for constant cache?
ncu-ui doesn’t seem to provide it

Curefab · September 4, 2024, 11:31am

At least the 8KiB figure is also stated in the Programming Guide CUDA C++ Programming Guide in table 21 as “Cache working set per SM for constant memory”. Or do they mean the L1 instruction cache (as it is 8 KiB in the Dissecting pdf, the L1 with 2 KiB is per SM, otherwise if it would be per SM Partition, 4 * 2 KiB would be 8 KiB)?

For (up to) Ampere values, you could look into “Capturing the Memory Topology of GPUs”, a thesis at the TUM: https://mediatum.ub.tum.de/doc/1689994/1689994.pdf

L1 is always around 2 KiB, L1.5 between 32 KiB and 64 KiB.

Robert_Crovella · September 4, 2024, 1:30pm

You may want to ask questions like that on the nsight compute forum.

rs277 · September 4, 2024, 7:43pm

From memory, there are few constant specific metrics, but Greg covers some constant info here.

Topic		Replies	Views
questions on constant memory CUDA Programming and Performance	3	3184	September 18, 2009
Constant Cache CUDA Programming and Performance	4	3722	August 28, 2019
Does constant cache exist in A100 GPUs? CUDA Programming and Performance	3	249	June 12, 2024
Multiprocessor architecture CUDA Programming and Performance	11	820	November 25, 2020
__constant__ use CUDA Programming and Performance	6	15150	June 14, 2008
Constant memory accesses by a threads of a half warp CUDA Programming and Performance	10	1868	October 27, 2014
cuda constant cache and L2 cache CUDA Programming and Performance	2	1774	March 18, 2014
__constant__ memory in function scope CUDA Programming and Performance	13	4595	June 1, 2015
Constant memory CUDA Programming and Performance	7	2620	February 24, 2010
constant cache CUDA Programming and Performance	3	2209	April 24, 2014

Constant cache details

Related topics