I have some questions on multiprocessor architecture when I reading the ‘cuda_c_programming_guide.pdf’. Here is the part of the compute capability 6.x:
Where is the read-only constant cache? I can’t find it in the GP104 SM diagram(see below).
What is the size of this read-only constant for each multiprocessor? Is it configurable?
Does the ‘L1/texture cache for reads from global memory’ mean directly from global memory to L1/texture cache, or from global memory to L2 cache and then from L2 cache to L1/texture cache? How is the effeciency comparision?
For Kepler, we are using the fixed-size L1 cache to cache accesses to local memory including register spills; however, for Maxwell&Pascal, we are using the shared-by-all-multiprocessors L2 cache, so how is the number of the block assigned to one multiprocessor determined?
To my knowledge, details of the constant cache have not made public by NVIDIA. There are various papers that attempt to reverse engineer GPU microarchitecture, for example, this one:
Jia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. “Dissecting the nvidia volta gpu architecture via microbenchmarking.” arXiv preprint arXiv:1804.06826 (2018).
I took a quick look and if I read the relevant table correctly, on Pascal there is a 2KB constant cache per SM, backed by a lower-level cache (per TPC?) of 64 KB for GP100 and 32 KB for GP104. Given the overall size of constant memory (one 64KB user-visible bank plus two or three miscellaneous banks), I am doubtful of the L2 claims.
One blog post makes the following claim about access to local memory (I have no way of independently verifying or refuting this information):
Another thing to notice is that unlike Maxwell but similar to Kepler, Pascal caches thread-local memory in the L1 cache. This can mitigate the cost of register spills compared to Maxwell.
Its in the SM. You may not find it on the SM diagram because not all SM diagrams cover all functional aspects of the SM.
You’ve already figured out the constant cache is 8kB per SM. It’s not configurable (not sure what you would configure about it, anyway).
It’s probably best to start by understanding the memory hierarchy. here (“older”) and here (“newer”) are 2 examples. Referring to the 2nd example (“Memory Chart”) the pathways being references are numbered 1 and 2. These correspond to request from kernel code pertaining to the logical “global” space. For GPUs with L1 enabled (or unified L1/Tex) these would first attempt to “hit” in the L1. Upon a miss, they would attempt to “hit” in the L2. Not the other way around.
blocks are assigned to multiprocessors by the block scheduler (CWD). The block assignment order is not specified. As long as there are blocks remaining to be scheduled, the CWD will schedule those blocks on any SM that has sufficient unused resources available to support an additional block. This process will continue until the blocks are exhausted. This is closely related to concepts of “occupancy”. If you study up on occupancy, perhaps by using the CUDA occupancy calculator that ships with the CUDA toolkit, you can learn how many blocks can be simultaneously resident on a SM.
On how local memory is cached, I am confused now since there are third types of description that as I understand are inconsistent. Please let me know whether I misunderstood them:
First is as I wrote that Pascal (compute capability 6.x) caches local memory in L2 cache, sourcing from the programming guide:
I have no deeper insights into the GPU memory hierarchy. I agree that some of these bits of the information that you found sprinkled throughout the official documentation seem to be inconsistent. This may have come about due to micro-architectural changes over the years that were only partially reflected in documentation updates.
The best way forward may be to file a bug report with NVIDIA, asking them to review, clarify (separately for each architecture, if need be), and correct this information.
On the 3rd question, I checked the Memory Chart, are your meaning that whichever local or global memory, process unit will always search in the L1 cache(as a part of the unified cache shown in this chart) and then L2 cache if necessary?
Yes, pretty much, subject to microarchitectural variation. Some microarchitectures (sm_35 comes to mind) don’t have L1 enabled by default, for global loads. Also, for some microarchitectures, the L1 and Tex cache are separate. That newer diagram depicts newer architectures, where they are generally unified. The other, older diagram depicts other, older architectures.
To get additional insight into microarchitectural caching differences (there are plenty) you may wish to successively read the tuning guides for Kepler through Ampere