I have a question about the profiler metrics that I get for the simple vector-vector addition kernel below,
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
int id = blockIdx.x*blockDim.x+threadIdx.x;
if (id < n)
c[id] = a[id] + b[id];
}
The profiler tells me the following,
Kernel: vecAdd(double*, double*, double*, int)
1 gld_throughput Global Load Throughput 102.92GB/s
1 dram_read_throughput Device Memory Read Throughput 103.54GB/s
1 l2_read_throughput L2 Throughput (Reads) 103.67GB/s
1 l1_cache_global_hit_rate L1 Global Hit Rate 0.00%
1 l2_l1_read_hit_rate L2 Hit Rate (L1 Reads) 0.00%
From the output, I can surmise that the kernel is not finding any of the necessary data in the L1, or L2 caches, which is why it is going to look in DRAM. This explains why gld_throughput is approximately equal to dram_read_throughput.
My questions is why is the kernel not finding any data in the caches? I have not disabled L1 caching…I don’t see why the L1 and L2 hit rates are 0 ??
and further, why is the l2_read_throughput nonzero, if the L2 hit rate is 0?
This is a streaming kernel, where every piece of data is touched exactly once. Since there is no data re-use, there are no cache hits. As for the L2 throughput being equal to the global load throughput it is because the load data is read through the L2 (but not found there).
Agreed, lack of data re-use for each indiviual data item does not necessarily mean there could be no cache hits. An initial access to a cache line would cause the entire line to be fetched, even if only some of the data in the cache line is used by the initial access that triggered the cache miss with following fetch. A subsequent access to a different location in the previously unused portion of that cache line could then hit the cache.
However in this code the data is read in contiguous streams following the “base + tid” access pattern. This causes each cache line to be fetched and consumed in its entirety on initial access, meaning we do not have multiple accesses to the fetched line, and thus no cache hits.
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n-1) {
c[id] = a[id] + b[id];
c[id] = a[id+1];
}
}
gives,
Kernel: vecAdd(double*, double*, double*, int)
1 dram_read_throughput Device Memory Read Throughput 95.880GB/s
1 gld_throughput Global Load Throughput 189.86GB/s
1 l1_cache_global_hit_rate L1 Global Hit Rate 37.76%
1 l2_l1_read_hit_rate L2 Hit Rate (L1 Reads) 19.71%
On a related note, I’ve been having trouble finding clear documentation online about what exactly the GPU is doing to the data with respect to the caches. I’ve found a lot of NVIDIA presentations, but I’d feel more comfortable with something akin to a book…If anyone can recommend anything, I’d be grateful.