i am trying to understand some metrics about sysmem :
sysmem access through L2 cache :
- lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit,
- lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss,
atomic operation on sysmem :
3. lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit,
4. lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss,
i did two tests and i have two questions :
for metrics 1 and 2 , i use this kernel with dim (1,32)to test, input and output are allocted by cudahostalloc :
__global__ void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *input, float *output, float *temp) {
output[threadIdx.x] = input[threadIdx.x];
}
result :
question 1 :
in my test, i only use 32 threads to read input and then write to the output. when read input, it would cause 4 sector miss on L2 cache, but during write , the outputs are not cached in L2 cache, it can cause 4 writes miss, but why the results are 4 sectors hit, so, my question is: why the results are 4 sectors miss and 4 sectors hit instead of 8 sectors miss .
for metrics 3 and 4 , i use this kernel with dim (1,32)to test, input are allocted by cudahostalloc:
__global__ void LtsTSectorsSrcnodeGpcApertureDeviceOpAtomDotAluLookupHitMissKernel(unsigned int *input) {
atomicAdd(&input[0], 10);
}
result :
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit result: 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss result: 32
question 2 :
why atomic operation on sysmem are always L2 cache miss , no hit.