Which write operations are atomic in CUDA?

BarryCuda · October 7, 2017, 5:06am

Multiple threads will be computing a large array in shared memory. So that threads do not interfere, I need to know which writes are atomic in CUDA runtime 9.0.

In other words, if I write C code

z=x

will the write be atomic if x and z are 8-bit (unsigned char), 16-bit (unsigned short), 32-bit (unsigned long), or 64-bit (unsigned long long).

By “atomic”, I mean that threads are guaranteed not to overwrite memory outside the intended size. A write might not be atomic if, for instance, storing a byte was implemented by the hardware as reading an 8-byte value into a register, replacing the byte being stored, then writing back the 8-bit value.

Does the answer differ for global memory?

Robert_Crovella · October 7, 2017, 6:56am

according to your definition, any value written to any naturally aligned location, in any memory space in CUDA is atomic

I’m reasonably sure that C-language semantics would not be maintained if that were not the case.

I have trouble imagining any current processor not fitting that definition.

I guess maybe x86 processors become “unatomic” (although by that I don’t mean that they write outside of the intended location(s), which would be bizarre behavior for any processor IMO) if you write to a non-aligned location. Such writes are illegal in CUDA and will generate a machine fault (corrupted context) reportable by cuda-memcheck or proper runtime API error checking.

BarryCuda · October 7, 2017, 7:57am

OK, that’s what I suspected, but I just wanted confirmation. In older Intel processors, unaligned loads/stores were not atomic.

I’m not sure what you mean by “C language semantics”. I was not aware C took account of multiple threads, other than using “volatile”.

Robert_Crovella · October 7, 2017, 12:44pm

A single thread, writing to data packed in an array, and outside the element you were writing to, could violate C language semantics, according to my way of thinking. I don’t need multiple threads to see that that is broken.

Unaligned loads/stores on older intel processors were perhaps non-atomic, but I’m quite sure they did not write anything to locations that should not be written to. And this statement also has nothing to do with multiple threads.

I don’t think your definition of non-atomic makes any sense, and I don’t think it is what was meant in the context of any possible non-atomicity of older intel processors.

BarryCuda · October 7, 2017, 7:09pm

We don’t disagree; I should have been more precise.

Multiple threads (but not a single thread) can cause unexpected results if writes are not atomic, and one thread overwrites another’s memory. For example, if memory reads and writes are done as aligned words, and byte writes are done by reading a word, changing one byte, and writing the result back, it is possible for two threads operating on the same word to produce an incorrect result, if writing a byte is not an atomic operation. The effect will appear as thread 1 writing outside the 1-byte boundary that it was supposed to if the write operation is interleaved with thread 2.

For the Intel example, see the Intel Processor manual vol. 3, section 8.1.1. This is referring to multiple processors, not multiple threads, but since I don’t know how the NVidia hw works, I was asking to be sure.

SPWorley · October 8, 2017, 3:15am

You’re not completely wrong to worry. Bitwise memory updates can indeed run into thread ordering problems during multiple writes. You might be using the valid but uncommon C/C++ bitfield structures. Writing into a structure’s bit is likely translated into SASS code of reading a byte, updating the byte with the bit change, and writing the byte, and that multiple instruction sequence could indeed overlap with another and therefore writes even to different bits could interfere.

Robert_Crovella · October 8, 2017, 1:35pm

Any situation in CUDA where 2 or more threads write to the same address is considered undefined (with a few minor exceptions - e.g. all threads writing the same value). Writing bits is just one example of UB in that category.

Topic		Replies	Views
Memory writes to the same location doc ver 0.8 vs doc ver 0.8.1 CUDA Programming and Performance	10	4099	May 4, 2007
Clarification on Memory Access issue CUDA Programming and Performance	1	3740	September 9, 2009
Concurrent writes by different blocks in a kernel CUDA Programming and Performance	4	1082	December 14, 2011
Simultaneous write Multiple threads writing to the same memory location? CUDA Programming and Performance	2	1151	June 6, 2010
behaviour of overlapping writes into shared memory from single warp (software attomic operations) CUDA Programming and Performance	3	1047	May 18, 2010
Multiple writes to global memory CUDA Programming and Performance	2	2154	May 6, 2008
non-atomic instruction by other warps? CUDA Programming and Performance	2	1448	March 9, 2009
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	7976	July 20, 2007
Shared memory bytewise memory write guarantees CUDA Programming and Performance	3	9491	June 1, 2009
Concurrent writes to global memory CUDA Programming and Performance	1	7663	July 21, 2010

Which write operations are atomic in CUDA?

Related topics