Description of CUDA atomicity requirements

ajay_nayak · October 31, 2023, 4:13am

I was going through NVIDA PTX documentation and came across mmio operations (8.4.1). What got me curious are few statements written in the description.

CUDA atomicity requirements at the specified scope:

1. Writes are always performed and are never combined within the scope specified.
2. Reads are always performed, and are not forwarded, prefetched, combined, or allowed to hit any cache within the scope specified.

This also points to libcu++. However, there is no mention of atomicity requirements here.

I wanted to know if the two statements about writes and reads are valid for atomic operations (atom.global) or atomicAdd operations.
For example, can it be inferred from the above statements that device-scoped atomic operations cannot hit in the L1 cache, and bypasses it?

Robert_Crovella · November 3, 2023, 6:58pm

Not really an answer to all your questions. And not directly responsive to what you have quoted out of the PTX guide.

But I have generally heard and been led to believe that CUDA global atomics (device-scoped) get “resolved” in the L2. This isn’t super-well documented by NVIDIA AFAIK, however with some careful searching its possible to find various references to the idea, such as here and here. So, regarding

I would generally agree with that statement. AFAIK the L1 is not involved in the resolution of global atomics (except possibly some kind of invalidation of L1 contents. I cannot support that idea but acknowledge it might be the case. It also might not be the case, so I personally would not write code that loads data into the L1 and also does global atomics on it. However a simple test suggests to me that L1 contents may be invalidated on a corresponding global atomic.)

Topic		Replies	Views
Atomic operation unit? CUDA Programming and Performance	7	3367	July 3, 2010
Will atomic operations invalidate L1? Jetson Orin NX cuda , jetson-inference	8	597	January 29, 2024
How to keep L1 and L2 cache consistent CUDA Programming and Performance	1	1409	October 27, 2011
How's atomic operations in CUDA implemented? CUDA Programming and Performance cuda , kernel , programming	8	4354	March 26, 2024
Atomic operation and variable access CUDA Programming and Performance	3	1040	January 13, 2021
Which write operations are atomic in CUDA? CUDA Programming and Performance	6	3458	October 8, 2017
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	3396	July 11, 2022
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7985	January 28, 2012
Memory Reading and Atomic Operations CUDA Programming and Performance	3	113	December 20, 2024
Read atomic CUDA Programming and Performance	4	1760	July 18, 2011

Description of CUDA atomicity requirements

Related topics