Multiprocessor architecture

Hi all,

I have some questions on multiprocessor architecture when I reading the ‘cuda_c_programming_guide.pdf’. Here is the part of the compute capability 6.x:

So my questions are:

  1. Where is the read-only constant cache? I can’t find it in the GP104 SM diagram(see below).
  2. What is the size of this read-only constant for each multiprocessor? Is it configurable?
  3. Does the ‘L1/texture cache for reads from global memory’ mean directly from global memory to L1/texture cache, or from global memory to L2 cache and then from L2 cache to L1/texture cache? How is the effeciency comparision?
  4. For Kepler, we are using the fixed-size L1 cache to cache accesses to local memory including register spills; however, for Maxwell&Pascal, we are using the shared-by-all-multiprocessors L2 cache, so how is the number of the block assigned to one multiprocessor determined?

best regards

To my knowledge, details of the constant cache have not made public by NVIDIA. There are various papers that attempt to reverse engineer GPU microarchitecture, for example, this one:

Jia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. “Dissecting the nvidia volta gpu architecture via microbenchmarking.” arXiv preprint arXiv:1804.06826 (2018).

I took a quick look and if I read the relevant table correctly, on Pascal there is a 2KB constant cache per SM, backed by a lower-level cache (per TPC?) of 64 KB for GP100 and 32 KB for GP104. Given the overall size of constant memory (one 64KB user-visible bank plus two or three miscellaneous banks), I am doubtful of the L2 claims.

One blog post makes the following claim about access to local memory (I have no way of independently verifying or refuting this information):

Another thing to notice is that unlike Maxwell but similar to Kepler, Pascal caches thread-local memory in the L1 cache. This can mitigate the cost of register spills compared to Maxwell.

Looking at the Programming Guide, all Compute versions >= 6.1 have 8kB of constant cache per SM.

I stand corrected! The relevant item in the table is:

Cache working set per SM for constant memory

I had naively searched for all instances of “constant cache”.

  1. Its in the SM. You may not find it on the SM diagram because not all SM diagrams cover all functional aspects of the SM.
  2. You’ve already figured out the constant cache is 8kB per SM. It’s not configurable (not sure what you would configure about it, anyway).
  3. It’s probably best to start by understanding the memory hierarchy. here (“older”) and here (“newer”) are 2 examples. Referring to the 2nd example (“Memory Chart”) the pathways being references are numbered 1 and 2. These correspond to request from kernel code pertaining to the logical “global” space. For GPUs with L1 enabled (or unified L1/Tex) these would first attempt to “hit” in the L1. Upon a miss, they would attempt to “hit” in the L2. Not the other way around.
  4. blocks are assigned to multiprocessors by the block scheduler (CWD). The block assignment order is not specified. As long as there are blocks remaining to be scheduled, the CWD will schedule those blocks on any SM that has sufficient unused resources available to support an additional block. This process will continue until the blocks are exhausted. This is closely related to concepts of “occupancy”. If you study up on occupancy, perhaps by using the CUDA occupancy calculator that ships with the CUDA toolkit, you can learn how many blocks can be simultaneously resident on a SM.

Thank you for your sharing! @njuffa

On how local memory is cached, I am confused now since there are third types of description that as I understand are inconsistent. Please let me know whether I misunderstood them:

First is as I wrote that Pascal (compute capability 6.x) caches local memory in L2 cache, sourcing from the programming guide:

Second is as you said that ‘Pascal caches thread-local memory in the L1 cache’, and I also found it in the Pascal tuning guide:

Third is saying like ‘local memory is not cached’ that I just found today in the ptx isa guide:

Thank you for your correcting! @rs277

I have no deeper insights into the GPU memory hierarchy. I agree that some of these bits of the information that you found sprinkled throughout the official documentation seem to be inconsistent. This may have come about due to micro-architectural changes over the years that were only partially reflected in documentation updates.

The best way forward may be to file a bug report with NVIDIA, asking them to review, clarify (separately for each architecture, if need be), and correct this information.

Thank you for your help ! @Robert_Crovella

On the 3rd question, I checked the Memory Chart, are your meaning that whichever local or global memory, process unit will always search in the L1 cache(as a part of the unified cache shown in this chart) and then L2 cache if necessary?

On the 4th question, I realize that I didn’t articulated well and my thinking was wrong. Thank you for correcting.

I agree. It’s probably because that the updates of different documentations are not synchronized.@njuffa

Yes, pretty much, subject to microarchitectural variation. Some microarchitectures (sm_35 comes to mind) don’t have L1 enabled by default, for global loads. Also, for some microarchitectures, the L1 and Tex cache are separate. That newer diagram depicts newer architectures, where they are generally unified. The other, older diagram depicts other, older architectures.

To get additional insight into microarchitectural caching differences (there are plenty) you may wish to successively read the tuning guides for Kepler through Ampere

for example, kepler

OK, Thank you for your advice! @Robert_Crovella