Say that threads access memory in a coalesced manner, e.g., in an ideal memcpy kernel implementation.
What are the expected L1 and L2 cache hit rates for this kernel?
For 32-bit accesses and a cache-line size of 128 bytes, would the hit-rate just be (1024-32)/1024 = 31/32 = 97% as 32 of every 1024 bits accessed lead to a miss? And will the hit rate be 30/32 = 94% if the accesses are not aligned with cache line boundaries?
For the 32-bit aligned case, and assuming the data are not already in the cache, I would expect that the hit rate is approximately zero. For pascal or newer GPUs, we can consider things on a sector (32 bytes) by sector basis. To a first order approximation, even if there is prefetching, I would not expect the prefetching to be highly useful or meaningful for an “ideal memcpy kernel”.
Perhaps you will prove me wrong? Did you try it? Default usage of nsight compute makes this fairly easy, because by default it will invalidate the cache prior to a kernel profile (you can modify this behavior, but its not needed for this test case.)
Example:
# cat t323.cu
template <typename T>
__global__ void copyk(T *d, const int sz){
for (int i = blockIdx.x*blockDim.x+threadIdx.x; i < sz; i+=gridDim.x*blockDim.x) d[i] = d[i+sz];
}
using mt = int;
const int sz = 1048576*32;
const int nTPB = 512;
int main(){
mt *d;
cudaMalloc(&d, 2*sz*sizeof(mt));
copyk<<<58*3, nTPB>>>(d, sz);
cudaDeviceSynchronize();
}
# nvcc -o t323 t323.cu -arch=sm_89
# ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit,lts__t_sectors_srcunit_tex_op_read_lookup_hit ./t323
==PROF== Connected to process 207144 (/root/bobc/t323)
==PROF== Profiling "void copyk<int>(T1 *, int)" - 0: 0%....50%....100% - 1 pass
==PROF== Disconnected from process 207144
[207144] t323@127.0.0.1
void copyk<int>(T1 *, int) (174, 1, 1)x(512, 1, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics
--------------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
--------------------------------------------------------- ----------- ------------
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.avg sector 0
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.max sector 0
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.min sector 0
l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.avg sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.max sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.min sector 0
lts__t_sectors_srcunit_tex_op_read_lookup_hit.sum sector 0
--------------------------------------------------------- ----------- ------------
#