Do 7.x devices have a readonly constant cache?

Back in Kepler, there was a description of a readonly cache:

GK110 adds the ability for read-only data in global memory to be loaded through the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures. Since this is a separate cache with a separate memory pipe and with relaxed memory coalescing rules, use of this feature can benefit the performance of bandwidth-limited kernels.

However, then Maxwell came along and completely re-worked how caches are handled.

Maxwell combines the functionality of the L1 and texture caches into a single unit.

Such being the case, I assumed that there was no longer a distinct read only constant cache. While I’m sure readonly data is still cached, I expect it’s done in the same cache as all other device memory accesses. Marking data as readonly might change the cache hints used to access it, but there is no longer a separate cache for it.

That’s what I assumed, but that’s not what the 7.x docs say:

An SM has:

  • a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
  • a unified data cache and shared memory with a total size of 128 KB (Volta) or 96 KB (Turing).

According to this, SMs have both a readonly cache and a unified cache. But I don’t believe it. It seems like a poor design to create a unified cache, but then partition some of it back out again to use solely for readonly data.

That specific text about a read-only constant cache has been copied over into the description of every architecture since Kepler (where it made sense) to 8.x (where I don’t think it does). This looks more like a case of “We don’t really know how this works, so let’s just leave this text alone.” My attempts to have the NVidia doc people review this line for accuracy were… unsuccessful.

And maybe that’s because I’m just wrong and this really is a thing? I don’t believe I am, but if so, this would be kind of exciting. Better use of cache can really boost performance, and if readonly memory has additional cache I’m not using, I want to know.

So if there is such a thing, can someone describe the specs for it? How big is it? Is it on-chip (like L1) or more like L2? Is it ‘stealing’ space from normal L1/L2 cache? What’s the performance like? Links?

LoL.

The constant memory space is associated with the decoration __constant__. That functionality (i.e. cache) is present in each SM in every CUDA capable GPU that I am aware of.

For devices of compute capability 7.x, referring to the global memory section in the programming guide, it says that global memory accesses behave the same as 5.x devices:

Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory).

Refering to the 5.x global memory section:

Data that is read-only for the entire lifetime of the kernel can also be cached in the unified L1/texture cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.

Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is, by default, not cached in the unified L1/texture cache, but caching may be enabled using the following mechanisms:

  • Perform the read using inline assembly with the appropriate modifier as described in the PTX reference manual;
  • Compile with the -Xptxas -dlcm=ca compilation flag, in which case all reads are cached, except reads that are performed using inline assembly with a modifier that disables caching;
  • Compile with the -Xptxas -fscm=ca compilation flag, in which case all reads are cached, including reads that are performed using inline assembly regardless of the modifier used.

When caching is enabled using one of the three mechanisms listed above, devices of compute capability 5.2 will cache global memory reads in the unified L1/texture cache for all kernel launches except for the kernel launches for which thread blocks consume too much of the SM’s register file. These exceptions are reported by the profiler.

So my takeaway is as follows.

  1. Data that is read-only can be cached in the unified L1/Tex cache which is a per-SM resource, and the size is usually indicated in the appropriate architecture whitepaper document. Read-only may be something that is: A. discovered by the compiler or B. discovered by the compiler with the assistance of __restrict__ and const decoration or C. explicitly indicated via the __ldg() intrinsic.
  2. Other data is not cached by default in L1/Tex.
  3. You can get other data to be cached using one of the indicated methods

So it appears the “read only cache” that was “new” and introduced with Kepler 3.5 has merged and become the same as the L1/Tex. I used “merged” here casually. For all I know, this “read only cache” maybe always has just been an adaptation of Tex cache to make it more friendly and easily usable by CUDA C++ programmers, without having to jump through all the texture hoops.

The __constant__ space is separate from all this discussion. The constant memory space is logically distinct from the global memory space.

While it is possible that documentation contains outdated information, once you file a bug with NVIDIA regarding documentation the people who maintain it certainly have access to the people who designed the hardware to make sure public information accurately reflects the hardware. So, did you file a bug? If so, what was the response from NVIDIA? If it resulted in closing the issue as “not a bug”, that means the documentation still accurately reflect the architecture.

Note that publicly available information may not reflect all of the microarchitectural detail, and NVIDIA in particular has historically remained tight-lipped about many such details. Keeping precise implementation details under wraps unfortunately is a necessary part of the hyper-competitive nature of high-tech industries.

If you absolutely want to know more, you could always design targeted microbenchmarks to unveil some details, as other people have done. I don’t see that knowing the details is necessary for successful CUDA programming, it would more be something to satisfy curiosity.

From Professional CUDA C Programming chapter 5, “The read-only cache is separate and distinct from the constant cache. Data loaded through the constant cache must be relatively small and must be accessed uniformly for good performance (all threads of a warp should access the same location at any given time), whereas data loaded through the read-only cache can be much larger and can be accessed in a non-uniform pattern.”

it seems like at least on Kepler, constant cache (through constant memory) and read only cache (through texture pipeline and used through __ldg) use different logic and have performance difference.