I was going through NVIDA PTX documentation and came across mmio operations (8.4.1). What got me curious are few statements written in the description.
CUDA atomicity requirements at the specified scope:
1. Writes are always performed and are never combined within the scope specified.
2. Reads are always performed, and are not forwarded, prefetched, combined, or allowed to hit any cache within the scope specified.
This also points to libcu++. However, there is no mention of atomicity requirements here.
I wanted to know if the two statements about writes and reads are valid for atomic operations (atom.global) or atomicAdd operations.
For example, can it be inferred from the above statements that device-scoped atomic operations cannot hit in the L1 cache, and bypasses it?
Not really an answer to all your questions. And not directly responsive to what you have quoted out of the PTX guide.
But I have generally heard and been led to believe that CUDA global atomics (device-scoped) get “resolved” in the L2. This isn’t super-well documented by NVIDIA AFAIK, however with some careful searching its possible to find various references to the idea, such as here and here. So, regarding
I would generally agree with that statement. AFAIK the L1 is not involved in the resolution of global atomics (except possibly some kind of invalidation of L1 contents. I cannot support that idea but acknowledge it might be the case. It also might not be the case, so I personally would not write code that loads data into the L1 and also does global atomics on it. However a simple test suggests to me that L1 contents may be invalidated on a corresponding global atomic.)