Atomicity of int (4 byte word) global memory write

If my kernel is writing integers to global memory, is there a possibility for an external observer (i.e. the host during kernel execution) to observe that only a part of the integer has been written while the rest hasn’t?

E.g. given that memory pointed to by (int *glob_mem) has been initialized with 0, and the following kernel is executed, could the host observe any value different from either 0 or 0x42424242?

//Kernel body
glob_mem[blockDim.x * blockIdx.x + threadIdx.x] = 0x42424242;

I studied the programming guide, but couldn’t confirm that hypothesis and I believe that’d be helpful to know.

Thank you!

As long as the access is properly aligned the int data is written in its entirety in one fell swoop. Data on the GPU must be “naturally aligned”, that is, the alignment must equal the access size, which is 4 bytes in this case. The behavior in cases of unaligned writes is undefined.

However, your assumption that the value in that location is either 0 or 0x42424242 will not hold unless you explicitly initialized the location to 0, prior to writing it with 0x42424242. In particular, the global memory location may still hold data placed there by a previous kernel, which sometimes allows bugs to go undetected.

That’s a great insight, thank you njuffa! That’s what I hoped to be the case, but couldn’t find any statement to support that theory.

Regarding the either ‘0’ or ‘0x42424242’, I’ve mentioned that “memory […] has been initialized with 0”, so I’m aware of that - at least :-)

Note that the issue can be trickier for data objects that exceed the 4-byte size or are of compound type. In those cases one would want to check the store instruction at SASS (mnachine code level) to make sure all data is written out in a single ST.64 or ST.128 instruction. In addition, one would want to establish that the host is accessing the data in a single access as well when sharing data between host and device.

Note that none of this is usually sufficient if there are multiple producers, in which case you would want to investigate compare-exchange operations or atomic ALU operations to prevent race conditions.