When considering profiler metrics pertaining to the GPU memory hierarchy, it’s useful to have a good mental picture of what that hierarchy looks like. Here is one example:
[url]cuda - nvprof option for bandwidth - Stack Overflow
Looking at that diagram, we see there are at least 2 paths that requests could be made to the L2, one coming from the L1, the other coming from the RO cache mechanism. Note that the cache partitioning at this level (i.e. at the L1 level) may vary by GPU type.
Therefore, an l2_l1 metric is concerned with the requests coming from the L1. It does not take into account (i.e. count requests from) other paths that may be making requests of the L2. Likewise other L2 metrics may be looking at other paths into the L2, or all requests targetting the L2 together.
Data is not explicitly cached at kernel launch. Data may already be in the cache from previous activity, but a kernel launch itself does not trigger population of caches with data.
When kernel code makes a request for data in the global logical space, if the L1 is enabled for global loads (this varies by GPU type) then first the request will be made to the L1. If the data is resident in the L1 (a “hit”), then the request is “serviced” from the L1, and no further “downstream” activity takes place. If the data is not resident in the L1 (a “miss”), then the L1 will generate a request to the L2 for the data. We have a similar hit/miss possibility, and similar behavior. If the data is resident in the L2, the L2 will service the L1 request. If the data is not resident in the L2, the L2 will request the data from GPU DRAM.
When the data eventually makes it back to the L1, then L1 cache lines are populated with that data, and the data is also returned to the code that requested it.
(if the L1 is not enabled for global activity, then the requests for global data go directly to the L2)
The GPU is a load/store architecture (mostly, I guess someone will argue with me about this), so the way your kernel code “requests” global data is via a global load instruction, which can have usually one of 2 forms at the machine code level, LD or LDG. These instructions request that data be placed in a particular GPU register. When the data is “returned” by the L1 to “your code”, it means that this register becomes populated with the “correct” value and the GPU warp scheduler is also (perhaps indirectly) informed of this fact.
L1 is a per-SM resource. There is a separate L1 cache for every SM. Code executing in SM 0 might “hit” in L1, whereas similar code (e.g. a different threadblock of the same kernel) could be executing on SM 1, and it might request the same data, but “miss” in the L1 and therefore have to go to the L2 to get the same data.
The L2 is a device-wide resource. All SMs have access to the same L2 and the same L2 data.
(Note that Fermi does not have a RO cache mechanism, but hierarchically it has the Texture cache in the same place, which is also a read-only cache system, with “its own” connection to L2.)