I’m trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to the PTX instruction
ld.weak.shared.cta which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (188.8.131.52):
However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized
which hints to load/store atomicity “per-default” in shared memory. Thus, would the instructions
ld.relaxed.shared.cta have the same effect?
Or is it an information the compiler needs anyway to avoid optimizing away load and store?
More generally, supposing variables are properly aligned, would
__shared__ int and
__shared__ cuda::atomic<int, cuda::thread_scope_block> provide the same guarantees (when considering only load and store operations)?
Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are
__device__ int and
__device__ cuda::atomic<int, cuda::thread_scope_block> equivalent in term of atomicity of load/store operations?
Thanks for any insight.