I’m trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta
which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (9.2.3.1):
However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized
which hints to load/store atomicity “per-default” in shared memory. Thus, would the instructions ld.weak.shared.cta
and ld.relaxed.shared.cta
have the same effect?
Or is it an information the compiler needs anyway to avoid optimizing away load and store?
More generally, supposing variables are properly aligned, would __shared__ int
and __shared__ cuda::atomic<int, cuda::thread_scope_block>
provide the same guarantees (when considering only load and store operations)?
Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are __device__ int
and __device__ cuda::atomic<int, cuda::thread_scope_block>
equivalent in term of atomicity of load/store operations?
Thanks for any insight.