Difference between L2 read/write transactions and L2_L1 read/write transactions ?

Hello everyone,

I was trying to understand those metrics on my NVIDIA Tesla M2075, which is based on the fermi architecture.

But I have a confusion around those two metrics, as it is written in documentation: for compute capability 2.0, global memory is cached in L1 and L2, and this happens automatically.

How this happens ???

I don’t understand how this happens, that’s why I can’t differenciate between them.

Please correct me if I am wrong, when we move the data from the CPU to the GPU, the data is actually cached at this moment, or when we launch the kernel ?

After, I can guess it is cached on the L2, it’s larger and can be shared among SMs ! meaning that data is cached in L1, when a threads within a a warp ask for data, that’s why it reads/writes from the L2 ??

Thanks for your reponses in advance

When considering profiler metrics pertaining to the GPU memory hierarchy, it’s useful to have a good mental picture of what that hierarchy looks like. Here is one example:


Looking at that diagram, we see there are at least 2 paths that requests could be made to the L2, one coming from the L1, the other coming from the RO cache mechanism. Note that the cache partitioning at this level (i.e. at the L1 level) may vary by GPU type.

Therefore, an l2_l1 metric is concerned with the requests coming from the L1. It does not take into account (i.e. count requests from) other paths that may be making requests of the L2. Likewise other L2 metrics may be looking at other paths into the L2, or all requests targetting the L2 together.

Data is not explicitly cached at kernel launch. Data may already be in the cache from previous activity, but a kernel launch itself does not trigger population of caches with data.

When kernel code makes a request for data in the global logical space, if the L1 is enabled for global loads (this varies by GPU type) then first the request will be made to the L1. If the data is resident in the L1 (a “hit”), then the request is “serviced” from the L1, and no further “downstream” activity takes place. If the data is not resident in the L1 (a “miss”), then the L1 will generate a request to the L2 for the data. We have a similar hit/miss possibility, and similar behavior. If the data is resident in the L2, the L2 will service the L1 request. If the data is not resident in the L2, the L2 will request the data from GPU DRAM.

When the data eventually makes it back to the L1, then L1 cache lines are populated with that data, and the data is also returned to the code that requested it.

(if the L1 is not enabled for global activity, then the requests for global data go directly to the L2)

The GPU is a load/store architecture (mostly, I guess someone will argue with me about this), so the way your kernel code “requests” global data is via a global load instruction, which can have usually one of 2 forms at the machine code level, LD or LDG. These instructions request that data be placed in a particular GPU register. When the data is “returned” by the L1 to “your code”, it means that this register becomes populated with the “correct” value and the GPU warp scheduler is also (perhaps indirectly) informed of this fact.

L1 is a per-SM resource. There is a separate L1 cache for every SM. Code executing in SM 0 might “hit” in L1, whereas similar code (e.g. a different threadblock of the same kernel) could be executing on SM 1, and it might request the same data, but “miss” in the L1 and therefore have to go to the L2 to get the same data.

The L2 is a device-wide resource. All SMs have access to the same L2 and the same L2 data.

(Note that Fermi does not have a RO cache mechanism, but hierarchically it has the Texture cache in the same place, which is also a read-only cache system, with “its own” connection to L2.)

Thank for your response, it’s very helpful.

Yet, I can’t find the answer to how this happens, supposing it’s my first kernel launch. When we move data from CPU to global memory GPU, it is directly cached to L2 ?

I ask this because I am seeing something strange when I run my kernel several times, the first iteration lasts more regarding others iterations (100 times): this is for matrix Multiplixation

1 iteration takes 7.69 us
2 to 98 iteations : ~= 7.19 us

Can we say that 0.50 is due to the firstdata request ? and once it’s available in L1, we have less time in other iterations, if we consider the lifetime of L1 and L2 is based on the host allocation lifetime.

It’s good to mention that this difference can be seen when data size is not very big.

Thanks in advance,

I wouldn’t be able to say exactly why your first kernel launch takes a bit longer (less than 10%) than subsequent launches. The only thing I can say is that is behavior I have seen elsewhere, not unique to that kernel. My guess would be caching plays a role.

Since kernel launch overhead is usually on the order of ~5us (perhaps much longer) explaining differences of 0.5us can be quite challenging. I usually don’t try to explain small performance differences in very small programs. A kernel that executes for 7us is a very short kernel.

Regarding your question about whether or not the results of cudaMemcpy to global memory are in the L2 cache or not, my recollection is that they are. However it should be not too difficult to do microbenchmarking to confirm this.

My mental model of the GPU is that all traffic to/from GPU DRAM flows through the L2. However I don’t know that the behavior of cudaMemcpy in this respect is documented anywhere.