Why atomic operation on sysmem always miss

Robert_Crovella · February 8, 2022, 11:35pm

If I run the following command on a system with Tesla V100:

$ ncu --query-metrics |grep lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit

I get some description:

lts__t_sectors_srcnode_gpc_aperture_sysmem_op_atom_dot_alu_lookup_hit       # of LTS sectors from node GPC accessing system memory (sysmem) for atomic ALU (non-CAS) that hit

Referring to the memory hierarchy diagram that is available in nsight compute:

example

we see that the access to sysmem flows “through” the L2.

I think you may get a better/more authoritative answer on the ncu forum, however my guess is as follows. The L2 cache on the device acts as a “proxy” for most global space accesses that would be backed by device memory. Therefore in resolving an atomic in the L2, it would first be necessary to determine whether the atomic target is in the L2 cache (a “hit”) or not (a “miss”).

system memory (i.e. host memory accessible because it is pinned/paged-locked) is typically not cached in L2 by default. I don’t think this is well documented but it’s possible to ascertain with a relatively simple microbenchmarking test. As a result, I would expect a global space access that targets sysmem to “never hit”, i.e. always “miss” (in the L2).

BTW, if you think carefully about what a sysmem atomic implies, you might not want it to ever “hit” in the L2.

Also, regarding your question here note that the case there is a bit different. You’re not using atomics there. Based on my own testing, sysmem accesses are not cached in L2 but may be cached in L1. Atomics generally “bypass” the L1, and get “resolved” in the L2. But for “ordinary” accesses to sysmem, I believe it is possible to have “hits” (in the L1).

Topic		Replies	Views
Metrics about sysmem access with L2 cache Nsight Compute	0	508	February 9, 2022
Ampere GPU L2 cache write miss policy CUDA Programming and Performance	3	1007	February 9, 2022
L2 Hit Rate(Texture Reads) becomes 100% when modifying memory never used CUDA Programming and Performance	7	2833	March 17, 2018
cudaMallocHost caching behavior CUDA Programming and Performance	4	829	March 1, 2019
P100 unified cache behaviour and how to disable it CUDA Programming and Performance	8	2172	October 25, 2018
Why l1 cache always hit at the first access CUDA NVCC Compiler cuda , jetson-orin	6	186	August 19, 2025
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7985	January 28, 2012
What is the difference between metric lts__t_sectors_aperture_sysmem_op_read.sum and lts__d_sectors_fill_sysmem.sum Nsight Compute	0	509	December 21, 2020
Why texture memory is better on Fermi? CUDA Programming and Performance	62	21565	January 28, 2011
Shared Memory Persistence CUDA Programming and Performance	17	2485	November 25, 2020

Why atomic operation on sysmem always miss

Related topics