Why atomic operation on sysmem always miss

If I run the following command on a system with Tesla V100:

$ ncu --query-metrics |grep lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit

I get some description:

lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit       # of LTS sectors from node GPC accessing system memory (sysmem) for atomic ALU (non-CAS) that hit

Referring to the memory hierarchy diagram that is available in nsight compute:

example

we see that the access to sysmem flows “through” the L2.

I think you may get a better/more authoritative answer on the ncu forum, however my guess is as follows. The L2 cache on the device acts as a “proxy” for most global space accesses that would be backed by device memory. Therefore in resolving an atomic in the L2, it would first be necessary to determine whether the atomic target is in the L2 cache (a “hit”) or not (a “miss”).

system memory (i.e. host memory accessible because it is pinned/paged-locked) is typically not cached in L2 by default. I don’t think this is well documented but it’s possible to ascertain with a relatively simple microbenchmarking test. As a result, I would expect a global space access that targets sysmem to “never hit”, i.e. always “miss” (in the L2).

BTW, if you think carefully about what a sysmem atomic implies, you might not want it to ever “hit” in the L2.

Also, regarding your question here note that the case there is a bit different. You’re not using atomics there. Based on my own testing, sysmem accesses are not cached in L2 but may be cached in L1. Atomics generally “bypass” the L1, and get “resolved” in the L2. But for “ordinary” accesses to sysmem, I believe it is possible to have “hits” (in the L1).