Back in Kepler, there was a description of a readonly cache:
GK110 adds the ability for read-only data in global memory to be loaded through the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures. Since this is a separate cache with a separate memory pipe and with relaxed memory coalescing rules, use of this feature can benefit the performance of bandwidth-limited kernels.
However, then Maxwell came along and completely re-worked how caches are handled.
Maxwell combines the functionality of the L1 and texture caches into a single unit.
Such being the case, I assumed that there was no longer a distinct read only constant cache. While Iām sure readonly data is still cached, I expect itās done in the same cache as all other device memory accesses. Marking data as readonly might change the cache hints used to access it, but there is no longer a separate cache for it.
Thatās what I assumed, but thatās not what the 7.x docs say:
An SM has:
- a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
- a unified data cache and shared memory with a total size of 128 KB (Volta) or 96 KB (Turing).
According to this, SMs have both a readonly cache and a unified cache. But I donāt believe it. It seems like a poor design to create a unified cache, but then partition some of it back out again to use solely for readonly data.
That specific text about a read-only constant cache has been copied over into the description of every architecture since Kepler (where it made sense) to 8.x (where I donāt think it does). This looks more like a case of āWe donāt really know how this works, so letās just leave this text alone.ā My attempts to have the NVidia doc people review this line for accuracy wereā¦ unsuccessful.
And maybe thatās because Iām just wrong and this really is a thing? I donāt believe I am, but if so, this would be kind of exciting. Better use of cache can really boost performance, and if readonly memory has additional cache Iām not using, I want to know.
So if there is such a thing, can someone describe the specs for it? How big is it? Is it on-chip (like L1) or more like L2? Is it āstealingā space from normal L1/L2 cache? Whatās the performance like? Links?