L1 and L2 cache hit rate

I have a question about the profiler metrics that I get for the simple vector-vector addition kernel below,

__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    int id = blockIdx.x*blockDim.x+threadIdx.x;
    if (id < n)
        c[id] = a[id] + b[id];
}

The profiler tells me the following,

Kernel: vecAdd(double*, double*, double*, int)
      1                  gld_throughput          Global Load Throughput  102.92GB/s
      1            dram_read_throughput   Device Memory Read Throughput  103.54GB/s
      1              l2_read_throughput           L2 Throughput (Reads)  103.67GB/s
      1        l1_cache_global_hit_rate              L1 Global Hit Rate       0.00%
      1             l2_l1_read_hit_rate          L2 Hit Rate (L1 Reads)       0.00%

From the output, I can surmise that the kernel is not finding any of the necessary data in the L1, or L2 caches, which is why it is going to look in DRAM. This explains why gld_throughput is approximately equal to dram_read_throughput.

My questions is why is the kernel not finding any data in the caches? I have not disabled L1 caching…I don’t see why the L1 and L2 hit rates are 0 ??

and further, why is the l2_read_throughput nonzero, if the L2 hit rate is 0?

This is a streaming kernel, where every piece of data is touched exactly once. Since there is no data re-use, there are no cache hits. As for the L2 throughput being equal to the global load throughput it is because the load data is read through the L2 (but not found there).

Very clear, thanks a lot!

Just because data is touched once doesn’t necessarily mean there would be no cache hits.

Any chip could read more data in then requested, thus the next data seek could still perform a cache hit.

So not seeing any cache hits doesn’t make sense.

Agreed, lack of data re-use for each indiviual data item does not necessarily mean there could be no cache hits. An initial access to a cache line would cause the entire line to be fetched, even if only some of the data in the cache line is used by the initial access that triggered the cache miss with following fetch. A subsequent access to a different location in the previously unused portion of that cache line could then hit the cache.

However in this code the data is read in contiguous streams following the “base + tid” access pattern. This causes each cache line to be fetched and consumed in its entirety on initial access, meaning we do not have multiple accesses to the fetched line, and thus no cache hits.

Story checks out:

__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    int id = blockIdx.x*blockDim.x+threadIdx.x;
 
    // Make sure we do not go out of bounds
    if (id < n-1) {
        c[id] = a[id]  + b[id];
        c[id] = a[id+1];
    }
}

gives,

Kernel: vecAdd(double*, double*, double*, int)
          1            dram_read_throughput   Device Memory Read Throughput  95.880GB/s
          1                  gld_throughput          Global Load Throughput  189.86GB/s
          1        l1_cache_global_hit_rate              L1 Global Hit Rate      37.76%
          1             l2_l1_read_hit_rate          L2 Hit Rate (L1 Reads)      19.71%

Thanks for the clarification njuffa

On a related note, I’ve been having trouble finding clear documentation online about what exactly the GPU is doing to the data with respect to the caches. I’ve found a lot of NVIDIA presentations, but I’d feel more comfortable with something akin to a book…If anyone can recommend anything, I’d be grateful.

It’s still somewhat strange. Let’s say cuda core 1 performs a memory lookup.

cuda core 2,3,4,5,6 and so forth also benefit from the memory lookup of core 1…

It seems that’s whats happening here…

However cuda core 2,3,4,5,6 didn’t really request that memory yet…

It was already done by cuda core 1…

So one could argue core 2 to whatever would have had a cache hit ?

The default compile option in Kepler is that it doest cache in L1 for global memory load. Hence there is no cache hit for L1 cache.