Metrics about sysmem access with L2 cache

i am trying to understand some metrics about sysmem :

sysmem access through L2 cache :

  1. lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_hit,
  2. lts__t_sectors_srcnode_gpc_aperture_sysmem_lookup_miss,

atomic operation on sysmem :
3. lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit,
4. lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss,

i did two tests and i have two questions :

for metrics 1 and 2 , i use this kernel with dim (1,32)to test, input and output are allocted by cudahostalloc :

__global__ void LtsTSectorsSrcnodeGpcApertureSysmemHitMissThreadNum32Kernel(float *input, float *output, float *temp) {
output[threadIdx.x] = input[threadIdx.x];

result :

question 1 :
in my test, i only use 32 threads to read input and then write to the output. when read input, it would cause 4 sector miss on L2 cache, but during write , the outputs are not cached in L2 cache, it can cause 4 writes miss, but why the results are 4 sectors hit, so, my question is: why the results are 4 sectors miss and 4 sectors hit instead of 8 sectors miss .

for metrics 3 and 4 , i use this kernel with dim (1,32)to test, input are allocted by cudahostalloc:

__global__ void LtsTSectorsSrcnodeGpcApertureDeviceOpAtomDotAluLookupHitMissKernel(unsigned int *input) {
    atomicAdd(&input[0], 10);

result :

lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit result: 0
lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_miss result: 32

question 2 :
why atomic operation on sysmem are always L2 cache miss , no hit.