i am trying to understand the ncu metrics:lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit And lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss,so i did some test on 3080:
test function, only start 32 thread :
test1 kernel:
test1 result:
test2 kernel:
global void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *input, float *output, float *temp) {
input[threadIdx.x] = input[threadIdx.x]*1.5f;
}
test2 result:
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit_miss test
==PROF== Connected to process 13248 (/home/yongjian/CUDA_Note/build/bin/test_lts_sector_GPC)
==PROF== Profiling “LtsTSectorsSrcnodeGpcAperture…” - 1: 0%…50%…100% - 1 pass
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit_miss test finished
==PROF== Disconnected from process 13248
[13248] test_lts_sector_GPC@127.0.0.1
LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *, float *, float *), 2021-Dec-28 11:10:14, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.avg sector 0,10
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.max sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.min sector 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.avg sector 0,10
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.max sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.min sector 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum sector 4
---------------------------------------------------------------------- --------------- ------------------------------
my point of view:
in test2 ,kernel read 32 float data(4 sector load miss), and multiply 1.5 ,then write to origin position, because the data already cache in L2 ,so the store operation will write hit, so there will be 4 sector store hit, the ncu results are the same as my guess。
but in test1,kernel just read 32 float data(4 sector load miss), then write to another position, but the result is also 4 sector miss and 4 sector hit, this really confused me,when kernel store the 32 float data,it will write miss, what will the L2 cache do? why the lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit still increase 4 sector hit?
can anyone help me? thank you very much!