Ampere GPU L2 cache write miss policy

i am trying to understand the ncu metrics:lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit And lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss,so i did some test on 3080
test function, only start 32 thread


test1 kernel

test1 result

test2 kernel
global void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *input, float *output, float *temp) {

input[threadIdx.x] = input[threadIdx.x]*1.5f;

}
test2 result
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit_miss test
==PROF== Connected to process 13248 (/home/yongjian/CUDA_Note/build/bin/test_lts_sector_GPC)
==PROF== Profiling “LtsTSectorsSrcnodeGpcAperture…” - 1: 0%…50%…100% - 1 pass
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit_miss test finished
==PROF== Disconnected from process 13248
[13248] test_lts_sector_GPC@127.0.0.1
LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *, float *, float *), 2021-Dec-28 11:10:14, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.avg sector 0,10
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.max sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.min sector 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit.sum sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.avg sector 0,10
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.max sector 4
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.min sector 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss.sum sector 4
---------------------------------------------------------------------- --------------- ------------------------------

my point of view
in test2 ,kernel read 32 float data(4 sector load miss), and multiply 1.5 ,then write to origin position, because the data already cache in L2 ,so the store operation will write hit, so there will be 4 sector store hit, the ncu results are the same as my guess。
but in test1,kernel just read 32 float data(4 sector load miss), then write to another position, but the result is also 4 sector miss and 4 sector hit, this really confused me,when kernel store the 32 float data,it will write miss, what will the L2 cache do? why the lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit still increase 4 sector hit?
can anyone help me? thank you very much!

can anyone help me?

You might get better responses asking ncu questions on the ncu forum.

ok, thanks for reply