Multiprocessor architecture

johnhuang · November 19, 2020, 10:38am

Hi all,

I have some questions on multiprocessor architecture when I reading the ‘cuda_c_programming_guide.pdf’. Here is the part of the compute capability 6.x:

So my questions are:

Where is the read-only constant cache? I can’t find it in the GP104 SM diagram(see below).
What is the size of this read-only constant for each multiprocessor? Is it configurable?
Does the ‘L1/texture cache for reads from global memory’ mean directly from global memory to L1/texture cache, or from global memory to L2 cache and then from L2 cache to L1/texture cache? How is the effeciency comparision?
For Kepler, we are using the fixed-size L1 cache to cache accesses to local memory including register spills; however, for Maxwell&Pascal, we are using the shared-by-all-multiprocessors L2 cache, so how is the number of the block assigned to one multiprocessor determined?

best regards

njuffa · November 19, 2020, 11:12am

To my knowledge, details of the constant cache have not made public by NVIDIA. There are various papers that attempt to reverse engineer GPU microarchitecture, for example, this one:

Jia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. “Dissecting the nvidia volta gpu architecture via microbenchmarking.” arXiv preprint arXiv:1804.06826 (2018).

I took a quick look and if I read the relevant table correctly, on Pascal there is a 2KB constant cache per SM, backed by a lower-level cache (per TPC?) of 64 KB for GP100 and 32 KB for GP104. Given the overall size of constant memory (one 64KB user-visible bank plus two or three miscellaneous banks), I am doubtful of the L2 claims.

One blog post makes the following claim about access to local memory (I have no way of independently verifying or refuting this information):

Another thing to notice is that unlike Maxwell but similar to Kepler, Pascal caches thread-local memory in the L1 cache. This can mitigate the cost of register spills compared to Maxwell.

rs277 · November 19, 2020, 9:27pm

Looking at the Programming Guide, all Compute versions >= 6.1 have 8kB of constant cache per SM.

njuffa · November 19, 2020, 9:31pm

I stand corrected! The relevant item in the table is:

Cache working set per SM for constant memory

I had naively searched for all instances of “constant cache”.

Robert_Crovella · November 19, 2020, 10:54pm

Its in the SM. You may not find it on the SM diagram because not all SM diagrams cover all functional aspects of the SM.
You’ve already figured out the constant cache is 8kB per SM. It’s not configurable (not sure what you would configure about it, anyway).
It’s probably best to start by understanding the memory hierarchy. here (“older”) and here (“newer”) are 2 examples. Referring to the 2nd example (“Memory Chart”) the pathways being references are numbered 1 and 2. These correspond to request from kernel code pertaining to the logical “global” space. For GPUs with L1 enabled (or unified L1/Tex) these would first attempt to “hit” in the L1. Upon a miss, they would attempt to “hit” in the L2. Not the other way around.
blocks are assigned to multiprocessors by the block scheduler (CWD). The block assignment order is not specified. As long as there are blocks remaining to be scheduled, the CWD will schedule those blocks on any SM that has sufficient unused resources available to support an additional block. This process will continue until the blocks are exhausted. This is closely related to concepts of “occupancy”. If you study up on occupancy, perhaps by using the CUDA occupancy calculator that ships with the CUDA toolkit, you can learn how many blocks can be simultaneously resident on a SM.

johnhuang · November 24, 2020, 9:43am

Thank you for your sharing! @njuffa

On how local memory is cached, I am confused now since there are third types of description that as I understand are inconsistent. Please let me know whether I misunderstood them:

First is as I wrote that Pascal (compute capability 6.x) caches local memory in L2 cache, sourcing from the programming guide:

Second is as you said that ‘Pascal caches thread-local memory in the L1 cache’, and I also found it in the Pascal tuning guide:

Third is saying like ‘local memory is not cached’ that I just found today in the ptx isa guide:

johnhuang · November 24, 2020, 9:44am

Thank you for your correcting! @rs277

njuffa · November 24, 2020, 9:55am

I have no deeper insights into the GPU memory hierarchy. I agree that some of these bits of the information that you found sprinkled throughout the official documentation seem to be inconsistent. This may have come about due to micro-architectural changes over the years that were only partially reflected in documentation updates.

The best way forward may be to file a bug report with NVIDIA, asking them to review, clarify (separately for each architecture, if need be), and correct this information.

johnhuang · November 24, 2020, 10:25am

Thank you for your help ! @Robert_Crovella

On the 3rd question, I checked the Memory Chart, are your meaning that whichever local or global memory, process unit will always search in the L1 cache(as a part of the unified cache shown in this chart) and then L2 cache if necessary?

On the 4th question, I realize that I didn’t articulated well and my thinking was wrong. Thank you for correcting.

johnhuang · November 24, 2020, 10:30am

I agree. It’s probably because that the updates of different documentations are not synchronized.@njuffa

Robert_Crovella · November 24, 2020, 3:24pm

Yes, pretty much, subject to microarchitectural variation. Some microarchitectures (sm_35 comes to mind) don’t have L1 enabled by default, for global loads. Also, for some microarchitectures, the L1 and Tex cache are separate. That newer diagram depicts newer architectures, where they are generally unified. The other, older diagram depicts other, older architectures.

To get additional insight into microarchitectural caching differences (there are plenty) you may wish to successively read the tuning guides for Kepler through Ampere

for example, kepler

johnhuang · November 25, 2020, 3:05am

OK, Thank you for your advice! @Robert_Crovella

Topic		Replies	Views
Do 7.x devices have a readonly constant cache? CUDA Programming and Performance	4	1564	July 30, 2022
Constant cache details CUDA Programming and Performance	7	244	September 4, 2024
Maxwell (sm_50) instruction: LDG.E ? CUDA Programming and Performance	25	8536	August 15, 2015
Memory transaction size CUDA Programming and Performance	1	1730	February 12, 2017
Pascal L1 cache CUDA Programming and Performance	21	11839	January 20, 2024
CUDA on G80 hardware questions... Mapping the execution model to hardware CUDA Programming and Performance	10	12410	April 10, 2008
__constant__ memory in function scope CUDA Programming and Performance	13	4595	June 1, 2015
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7151	April 17, 2013
__constant__ use CUDA Programming and Performance	6	15150	June 14, 2008
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23521	March 21, 2011

Multiprocessor architecture

Related topics