In V100 GPU or A100 GPU, CUDA COREs- data miss in registers where do the CUDA core look first for data - in Shared Memory or in L1 cache?

I assume in context to data fetch for CUDA core - registers are the fastest, next shared memory , then L1 cache, next L2 cache and then global memory is the slowest.

I assume in a GPU data moves in the following hierarchy -

GLOBAL MEMORY → L2 cache → L1 cache → Shared memory → Registers → CUDA CORE

Question 1: If a CUDA core does not finds the requested data in the registers, where does the CUDA core next looks for the data - in the shared memory or L1 cache? ChatGpt mentions, for A100 or V100 GPU it will look for data in L1 cache and if it does not finds requested data in L1 cache, next it will look for requested data in shared memory. But this does not sounds correct to me. since shared memory is faster than L1 cache, so in my logical opinion once a CUDA core is not able to find requested data in registers, a CUDA core should look next, for the data in the next fastest memory(shared memory) before looking for requested data in L1 cache. So could you please let me know if I am correct, if not correct can you please let me know what I am missing?

Question 2: Suppose in my CUDA kernel, I do not declare any shared memory arrays and instead use device arrays. I visualize data movement hierarchy in a GPU in the following hierarchy

GLOBAL MEMORY → L2 cache → L1 cache → Shared memory → Registers → CUDA CORE

Question - Scenario a ) - When the CUDA core requests for data, will the data be stored in the shared memory even if we have no shared memory arrays declared and are only using device arrays? I ask this because I visulize data movement hierarchy where Shared Memory is between L1 cache and Registers during moving data between different memory hierarchy in GPU. If I am not right in my understanding, could you please let me know what I am missing?

Question - Scenario b ) - in the scenario when we declare no shared memory array and only declare device array, suppose the data does not gets stored on shared memory and only gets stored in L1 cache. If this is true, is the shared memory switched off when no shared arrays are declared and the data moves directly from L1 cache to registers?

Most of this is handled by the compiler. CUDA cores do not “look for data”. They operate on registers, always (yes, I know there are exceptions. I don’t view them as essential material for this discussion). The CUDA GPU is a load-store architecture (yes, I know there are exceptions.) Therefore all instructions are either moving data to and from registers, or operating on data in registers. That is true for any instruction involving a CUDA core.

Shared memory is in the memory hierarchy to be sure, but data does not flow through shared memory “naturally” on its way from global memory to a register, like it does naturally flow through the caches on such a journey (please, lets skip the exceptions for now, OK?) Caches get populated “automatically”. Shared memory has to be populated explicitly, that is you have to write source code to populate shared memory. You don’t have to write (unusual, exceptional, explicit) source code to populate the caches.

It behaves much like a CPU. It inspects the caches for hit or miss. In the event of a miss it goes to main memory.

that is never true, currently, of any CUDA GPU. Shared memory transactions are explicitly identifiable in source code, and they will never involve traffic to the global space. Likewise global space traffic will never touch shared memory, unless it is one of the unusual asynchronous load instructions, recently introduced in CUDA, that load shared memory directly from global space. Even in these cases, the activity is explicitly identifiable at the source code level. It does not happen automatically, like traffic to/from caches happens “automatically” for global traffic (ordinarily).

No, not correct. In my view I have already covered this. The correct pattern would be closer to:

GLOBAL MEMORY → L2 cache → L1 cache → Registers → CUDA CORE

No. Already covered. Data only touches shared memory when you as a programming write explicit source code to do so. And when you do, that data never follows the path:

GLOBAL MEMORY → L2 cache → L1 cache → Shared memory

Unless you explicitly write the code to cause that to happen.

Unless you explicitly move data to or from shared memory, it never touches shared memory. Ordinary loads or stores to/from the global space will not flow through shared memory, and data requested from the global space cannot be found, located, or looked for, in shared memory. In CUDA, the shared space and the global space are logically distinct.

For reference, a GPU memory hierarchy diagram looks like this

1 Like

Hi Robert, thank you so much for such a good explanation in here and also on stack overlfow. also that GPU memory hierarchy diagram helped to visualize the flow of data correctly. I can see the diagram on Stackoverflow but when I click the Nsight document link , it mentions “PAGE NOT FOUND”. are there other links where I can find the Nsight documentation?

Also the occupancy calculator spreadsheet is no where to be found. Is there any alternative to it.

I’ve updated that posting. A similar chart is here

If you really want an occupancy calculator spreadsheet, you can download and install an older CUDA toolkit and get it from where it is installed there. I don’t wish to facilitate any more than that, because the tool is no longer maintained (that is why it’s not readily available). Anyone using it now may be taking on some risk. Instead, you are expected to use the occupancy API, or similar facilities in profilers.

1 Like