Concurrent writes by different blocks in a kernel

surethought · August 29, 2011, 11:28pm

I’m looking for an NVidia employee or community expert to answer the following. If I launch a kernel and try to perform the following operation:

acc[threadIdx.x] += r[threadIdx.x]

…where acc is an array in global device memory. Clearly, no two threads in the same block will try to write to the same position because of threadIdx.x as an index. However, two threads from different blocks may try to write to acc[threadIdx.x] variable at once if two blocks are executing concurrently. At the moment, my best solution is to try to make the block so many threads that blocks effectively never run concurrently, but this is inelegant.

My questions are:

Is the += operator atomic?

If two blocks have different threads that simultaneously write to the same location in device memory, will the memory writes stall or will they be corrupted?

Are the results even well-defined?

kbam · August 29, 2011, 11:53pm

+= is not atomic
results are undefined whichever thread updates last that will be the result. i.e it will overwrite any previous content
can use AtomicAdd

+= is actually shorthand for

load from global memory to register (takes many clock cycles)
add registers
store from register to global memory (takes many clock cycles)

So considering the following scenario with two blocks R and S and an initial value in acc[threadIdx.x] of 9
thread in R asks for the data
thread in S asks for the data
both read value of 9
both do the register addition
( R adds say 3 so its going to write back 12, S adds say 5 so its going to write back 14)
both issue a store
but it depends which store is applied last as to what the final value will be, and that depends on many factors.
So in this case final result would be either 12 or 14

radome · December 10, 2011, 3:26pm

Hello! I’ve another question, suppose that I don’t care about the order in which writes are performed.

Is the single write atomic or it could result in an undefined value? (eg. neither 12 nor 24)

Thank you :)

DrAnderson42 · December 13, 2011, 8:04pm

A single memory write of 1, 4, 8, or 16 bytes is atomic (assuming it compiles to a single st.global.?? ptx instruction) (i.e. char, float, float2, float4). Writes of composite structures (or anything else that results in the compiler issuing multiple st.global instructions) will not be atomic.

radome · December 14, 2011, 11:51am

Very clear! Thank you! External Image

Topic		Replies	Views
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25880	July 5, 2007
Threads and Race Condition CUDA Programming and Performance	11	3059	April 30, 2012
Any locking mechanism? CUDA Programming and Performance	9	3778	July 25, 2007
Missing writes to global mem CUDA Programming and Performance	3	1115	April 22, 2009
Parralel += on argument array Can this work CUDA Programming and Performance	5	3221	December 15, 2007
simple cdot implementation beginner question CUDA Programming and Performance	4	5836	August 21, 2007
Clarification on Memory Access issue CUDA Programming and Performance	1	3753	September 9, 2009
How to use multithread to accumulate one variable CUDA Programming and Performance	3	2269	October 24, 2008
thread writing into global memory (thread sync) CUDA Programming and Performance	2	1603	August 23, 2009
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8030	July 20, 2007

Concurrent writes by different blocks in a kernel

Related topics