Concurrent writes by different blocks in a kernel

I’m looking for an NVidia employee or community expert to answer the following. If I launch a kernel and try to perform the following operation:

acc[threadIdx.x] += r[threadIdx.x]

…where acc is an array in global device memory. Clearly, no two threads in the same block will try to write to the same position because of threadIdx.x as an index. However, two threads from different blocks may try to write to acc[threadIdx.x] variable at once if two blocks are executing concurrently. At the moment, my best solution is to try to make the block so many threads that blocks effectively never run concurrently, but this is inelegant.

My questions are:

Is the += operator atomic?

If two blocks have different threads that simultaneously write to the same location in device memory, will the memory writes stall or will they be corrupted?

Are the results even well-defined?

+= is not atomic
results are undefined whichever thread updates last that will be the result. i.e it will overwrite any previous content
can use AtomicAdd

+= is actually shorthand for

  1. load from global memory to register (takes many clock cycles)
  2. add registers
  3. store from register to global memory (takes many clock cycles)

So considering the following scenario with two blocks R and S and an initial value in acc[threadIdx.x] of 9
thread in R asks for the data
thread in S asks for the data
both read value of 9
both do the register addition
( R adds say 3 so its going to write back 12, S adds say 5 so its going to write back 14)
both issue a store
but it depends which store is applied last as to what the final value will be, and that depends on many factors.
So in this case final result would be either 12 or 14

Hello! I’ve another question, suppose that I don’t care about the order in which writes are performed.

Is the single write atomic or it could result in an undefined value? (eg. neither 12 nor 24)

Thank you :)

A single memory write of 1, 4, 8, or 16 bytes is atomic (assuming it compiles to a single st.global.?? ptx instruction) (i.e. char, float, float2, float4). Writes of composite structures (or anything else that results in the compiler issuing multiple st.global instructions) will not be atomic.

Very clear! Thank you! External Image