Say that threads access memory in a coalesced manner, e.g., in an ideal memcpy kernel implementation.
What are the expected L1 and L2 cache hit rates for this kernel?
For 32-bit accesses and a cache-line size of 128 bytes, would the hit-rate just be (1024-32)/1024 = 31/32 = 97% as 32 of every 1024 bits accessed lead to a miss? And will the hit rate be 30/32 = 94% if the accesses are not aligned with cache line boundaries?
For the 32-bit aligned case, and assuming the data are not already in the cache, I would expect that the hit rate is approximately zero. For pascal or newer GPUs, we can consider things on a sector (32 bytes) by sector basis. To a first order approximation, even if there is prefetching, I would not expect the prefetching to be highly useful or meaningful for an “ideal memcpy kernel”.
Perhaps you will prove me wrong? Did you try it? Default usage of nsight compute makes this fairly easy, because by default it will invalidate the cache prior to a kernel profile (you can modify this behavior, but its not needed for this test case.)
Example:
# cat t323.cu
template <typename T>
__global__ void copyk(T *d, const int sz){
for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < sz; i+=gridDim.x*blockDim.x) d[i] = d[i+sz];
}
using mt = int;
const int sz = 1048576*32;
const int nTPB = 512;
int main(){
mt *d;
cudaMalloc(&d, 2*sz*sizeof(mt));
copyk<<<58*3, nTPB>>>(d, sz);
cudaDeviceSynchronize();
}
# nvcc -o t323 t323.cu -arch=sm_89
# ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit,lts__t_sectors_srcunit_tex_op_read_lookup_hit ./t323
==PROF== Connected to process 207144 (/root/bobc/t323)
==PROF== Profiling "void copyk<int>(T1 *, int)" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 207144
[207144] t323@127.0.0.1
void copyk<int>(T1 *, int) (174, 1, 1)x(512, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics
--------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------------------------------------- ----------- ------------
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.avg sector 0
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.max sector 0
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.min sector 0
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.avg sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.max sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.min sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum sector 0
--------------------------------------------------------- ----------- ------------
#
Thank you for the clear example code. You are correct!
Every (read) access is a cache miss because a sector is only 128 bytes (32 ints) in L1 and 32 bytes (8 ints) in L2, so only for smaller data types such as char will the hit rate be above zero.
I wouldn’t necessarily expect the hit rate to be above zero for retrieving char quantities either.
A warp consists of 32 threads. If those 32 threads are each requesting a byte of data, and the 32 bytes for the warp are adjacent (i.e. coalesced) then at most two sectors will be requested. If the 32 bytes are aligned, then only one sector will be requested/retrieved per warp. In that situation, I personally would have no expectation that the observed hit rate would be higher than zero, if the data is not already in the cache.
We could imagine that things like prefetching might be in play. However:
I know of no guarantees provided for prefetching
prefetching implies some sort of temporal characteristics. For a bulk load (lots of data being loaded, “all at once”, by many threads) I would have no reason to believe that any prefetching mechanism will actually deliver the data to the cache before the warp that needs it actually requests it.
In pascal and newer, a sector is 32 bytes, regardless of whether we have L1 or L2 in view. You can discover this in the profiler documentation and it is also referred to here.
In addition, as a wise man wrote on this forum (and also described in Robert’s first link in the previous reply) ;-):
In modern GPUs (say, Pascal and newer) both the L1 and L2 cache can be populated sector-by-sector. The minimum granularity is 1 sector or 32 bytes. The cache line tag, however, applies to 4 sectors (in each case) that comprise the 128-byte cache line. You can adjust L2 cache granularity.
I see.
The profiler documentation defines a sector as:
[An] aligned 32 byte-chunk of memory in a cache line or device memory. An L1 or L2 cache line is four sectors, i.e. 128 bytes.
And from your second link:
[…] On Pascal the data access unit is 32B regardless of whether global loads are cached in L1.
This simplifies things. All accesses are in 32B chunks, and I suppose the figures here help to illustrate how a sector miss leads to a request higher up the memory hierarchy. But what is a cache line, then…
I get an L1 hit rate of 35-40% for the char case according to the metric l1tex__t_sector_hit_rate.pct (which is what the NSight Compute GUI defines as the L1/TEX Hit Rate). However, the sector hit rate lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum is still zero. Does this mean the “L1/TEX Hit Rate” is not representative of the performance, since it is actually the sector hit rate that matters in practice?
I tried to set cudaLimitMaxL2FetchGranularity to 128, but this changed neither the performance nor the metrics. I’m not sure what the granularity here refers to or if it has any impact on my RTX 4060 (laptop) GPU. Maybe this is more relevant for reused data and not for my use case.
Do cache lines actually matter in modern architectures, or are all accesses performed w.r.t. sectors?
A cache line is a group of sectors served by a single tag entry. But data retrieval from DRAM is done on a sector-by-sector basis. If 4 adjacent sectors served by a single tag are needed/requested, they will go into a single cache line.
The .level::prefetch_size qualifier is a hint to fetch additional data of the specified size into the respective cache level.The sub-qualifier prefetch_size can be set to either of 64B, 128B, 256B thereby allowing the prefetch size to be 64 Bytes, 128 Bytes or 256 Bytes respectively.
The .level::prefetch_size qualifier is treated as a performance hint only.
As it is a hint, the actual transferred memory could differ.
(And perhaps the L1 cache subsystem sees that more sectors within the same 128 bytes are requested in the pipeline and combines requests??? Leading to a L1 hit rate > 0%)
Perhaps you could compare the amount of data transfered from global memory to L2, and L2 to L1, and L1 to your kernel, and the theoretical amount of data read by your program?
Then it seems cache lines do not matter in terms of performance since fetches are done on a sector-by-sector basis, meaning the metric “L1/TEX Hit Rate” in NSight Compute can be misleading since it only gives the hit rate in terms of cache lines and not sectors.
Edit:
Never mind, I got a non-zero “L1/TEX Hit Rate” because it included the write accesses and not only read accesses. It appears this metric does measure the hit rate in terms of sectors.
Good idea. I tried using this L2 cache prefetch hint. It did not have a very large impact on the duration of the copyk kernel, but for char data, it improved the read-access L2 hit rate from 0% to 11%. For int data there was no difference.