Which write operations are atomic in CUDA?

Multiple threads will be computing a large array in shared memory. So that threads do not interfere, I need to know which writes are atomic in CUDA runtime 9.0.

In other words, if I write C code

z=x

will the write be atomic if x and z are 8-bit (unsigned char), 16-bit (unsigned short), 32-bit (unsigned long), or 64-bit (unsigned long long).

By “atomic”, I mean that threads are guaranteed not to overwrite memory outside the intended size. A write might not be atomic if, for instance, storing a byte was implemented by the hardware as reading an 8-byte value into a register, replacing the byte being stored, then writing back the 8-bit value.

Does the answer differ for global memory?

according to your definition, any value written to any naturally aligned location, in any memory space in CUDA is atomic

I’m reasonably sure that C-language semantics would not be maintained if that were not the case.

I have trouble imagining any current processor not fitting that definition.

I guess maybe x86 processors become “unatomic” (although by that I don’t mean that they write outside of the intended location(s), which would be bizarre behavior for any processor IMO) if you write to a non-aligned location. Such writes are illegal in CUDA and will generate a machine fault (corrupted context) reportable by cuda-memcheck or proper runtime API error checking.

OK, that’s what I suspected, but I just wanted confirmation. In older Intel processors, unaligned loads/stores were not atomic.

I’m not sure what you mean by “C language semantics”. I was not aware C took account of multiple threads, other than using “volatile”.

A single thread, writing to data packed in an array, and outside the element you were writing to, could violate C language semantics, according to my way of thinking. I don’t need multiple threads to see that that is broken.

Unaligned loads/stores on older intel processors were perhaps non-atomic, but I’m quite sure they did not write anything to locations that should not be written to. And this statement also has nothing to do with multiple threads.

I don’t think your definition of non-atomic makes any sense, and I don’t think it is what was meant in the context of any possible non-atomicity of older intel processors.

We don’t disagree; I should have been more precise.

Multiple threads (but not a single thread) can cause unexpected results if writes are not atomic, and one thread overwrites another’s memory. For example, if memory reads and writes are done as aligned words, and byte writes are done by reading a word, changing one byte, and writing the result back, it is possible for two threads operating on the same word to produce an incorrect result, if writing a byte is not an atomic operation. The effect will appear as thread 1 writing outside the 1-byte boundary that it was supposed to if the write operation is interleaved with thread 2.

For the Intel example, see the Intel Processor manual vol. 3, section 8.1.1. This is referring to multiple processors, not multiple threads, but since I don’t know how the NVidia hw works, I was asking to be sure.

You’re not completely wrong to worry. Bitwise memory updates can indeed run into thread ordering problems during multiple writes. You might be using the valid but uncommon C/C++ bitfield structures. Writing into a structure’s bit is likely translated into SASS code of reading a byte, updating the byte with the bit change, and writing the byte, and that multiple instruction sequence could indeed overlap with another and therefore writes even to different bits could interfere.

Any situation in CUDA where 2 or more threads write to the same address is considered undefined (with a few minor exceptions - e.g. all threads writing the same value). Writing bits is just one example of UB in that category.