L1 and L2 cache hit rate

voilouvoila · April 5, 2014, 4:20am

I have a question about the profiler metrics that I get for the simple vector-vector addition kernel below,

__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    int id = blockIdx.x*blockDim.x+threadIdx.x;
    if (id < n)
        c[id] = a[id] + b[id];
}

The profiler tells me the following,

Kernel: vecAdd(double*, double*, double*, int)
      1                  gld_throughput          Global Load Throughput  102.92GB/s
      1            dram_read_throughput   Device Memory Read Throughput  103.54GB/s
      1              l2_read_throughput           L2 Throughput (Reads)  103.67GB/s
      1        l1_cache_global_hit_rate              L1 Global Hit Rate       0.00%
      1             l2_l1_read_hit_rate          L2 Hit Rate (L1 Reads)       0.00%

From the output, I can surmise that the kernel is not finding any of the necessary data in the L1, or L2 caches, which is why it is going to look in DRAM. This explains why gld_throughput is approximately equal to dram_read_throughput.

My questions is why is the kernel not finding any data in the caches? I have not disabled L1 caching…I don’t see why the L1 and L2 hit rates are 0 ??

and further, why is the l2_read_throughput nonzero, if the L2 hit rate is 0?

njuffa · April 5, 2014, 4:31am

This is a streaming kernel, where every piece of data is touched exactly once. Since there is no data re-use, there are no cache hits. As for the L2 throughput being equal to the global load throughput it is because the load data is read through the L2 (but not found there).

voilouvoila · April 5, 2014, 4:56am

Very clear, thanks a lot!

Skybuck · April 5, 2014, 5:51am

Just because data is touched once doesn’t necessarily mean there would be no cache hits.

Any chip could read more data in then requested, thus the next data seek could still perform a cache hit.

So not seeing any cache hits doesn’t make sense.

njuffa · April 5, 2014, 6:25am

Agreed, lack of data re-use for each indiviual data item does not necessarily mean there could be no cache hits. An initial access to a cache line would cause the entire line to be fetched, even if only some of the data in the cache line is used by the initial access that triggered the cache miss with following fetch. A subsequent access to a different location in the previously unused portion of that cache line could then hit the cache.

However in this code the data is read in contiguous streams following the “base + tid” access pattern. This causes each cache line to be fetched and consumed in its entirety on initial access, meaning we do not have multiple accesses to the fetched line, and thus no cache hits.

voilouvoila · April 5, 2014, 2:51pm

Story checks out:

__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    int id = blockIdx.x*blockDim.x+threadIdx.x;
 
    // Make sure we do not go out of bounds
    if (id < n-1) {
        c[id] = a[id]  + b[id];
        c[id] = a[id+1];
    }
}

gives,

Kernel: vecAdd(double*, double*, double*, int)
          1            dram_read_throughput   Device Memory Read Throughput  95.880GB/s
          1                  gld_throughput          Global Load Throughput  189.86GB/s
          1        l1_cache_global_hit_rate              L1 Global Hit Rate      37.76%
          1             l2_l1_read_hit_rate          L2 Hit Rate (L1 Reads)      19.71%

Thanks for the clarification njuffa

voilouvoila · April 5, 2014, 3:13pm

On a related note, I’ve been having trouble finding clear documentation online about what exactly the GPU is doing to the data with respect to the caches. I’ve found a lot of NVIDIA presentations, but I’d feel more comfortable with something akin to a book…If anyone can recommend anything, I’d be grateful.

Skybuck · April 6, 2014, 12:44pm

It’s still somewhat strange. Let’s say cuda core 1 performs a memory lookup.

cuda core 2,3,4,5,6 and so forth also benefit from the memory lookup of core 1…

It seems that’s whats happening here…

However cuda core 2,3,4,5,6 didn’t really request that memory yet…

It was already done by cuda core 1…

So one could argue core 2 to whatever would have had a cache hit ?

iamkaka · February 3, 2016, 4:38am

The default compile option in Kepler is that it doest cache in L1 for global memory load. Hence there is no cache hit for L1 cache.

Topic		Replies	Views
Understanding Caching/Flushing Behavior/Performance in computeprof for Kepler CUDA Programming and Performance	6	3326	September 19, 2014
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3070	December 11, 2014
Can't profile L1 and L2 hit ratios on K40 and Titan Z CUDA Programming and Performance	2	902	February 20, 2016
Memory transaction size CUDA Programming and Performance	1	1730	February 12, 2017
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	3052	July 3, 2024
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1435	August 28, 2019
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	103	January 8, 2025
L2 cache read misses vs L2 cache write misses CUDA Programming and Performance	5	2454	February 5, 2014
[Jetson-TK1] How to measure DRAM <-> L2 R/W bandwidth on Tegra K1? Jetson TK1	3	1675	August 12, 2015
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	931	March 22, 2023

L1 and L2 cache hit rate

Related topics