Threads and Race Condition

Hello everyone.

I was under the assumption that if many threads need to write to the same location in global memory, the scheduler would serialize and protect the integrity of all writes? Such as:


int tx = …;

int ty = …;

int i = threadIdx.x;

// thread i in block (tx,ty) does some calculations to obtain some kind of update value

int val = …;

gmem[tx][ty] += val;


But perhaps my assumption is wrong. I thought the += would work, but it doesn’t?



That is exactly what atomic functions are for. Check out Appendix B.10 of the Programming Guide.

Okay, that makes sense. Is there an atomicAdd method that supports double precision? If not, I’m screwed. Ugh, just read fp one only exists for device 2.0 or higher. I’m at 1.3, so I’m sure that’s a no go.

I might be extrapolating from that code fragment too much, but are you needing to accumulate one double per block? If that is the case, you can do a reduction in shared memory, then have thread 0 write out the sum at the end of the kernel.

I will explain. Each thread block uses x amount of threads to do some calculations and provide an update calculation to the same GMEM location. But a different thread block uses all of its x thread for a different GMEM location than the first thread block I mentioned. So, there’s a problem when each x thread in a given thread block try to += a value in GMEM. Apparently some of the values get stepped on and don’t write properly.

Yes, these are double values I’m updating with. If these were integers, I think I would be okay.

Seibert has a point here. However, if you still need an atomicAdd for double values, try this:

__device__ double atomicAdd(double *address, double value)


	long long oldval, newval, readback;

	oldval = __double_as_longlong(*address);

	newval = __double_as_longlong(__longlong_as_double(oldval) + value);

	while ((readback=atomicCAS((long long *)address, oldval, newval)) != oldval) {

		oldval = readback;

		newval = __double_as_longlong(__longlong_as_double(oldval) + value);


	return __longlong_as_double(oldval);


It works in global memory, but not in shared. Fermi’s float atomicAdd() would of course be a lot faster.

Thanks, tera. I will take a look at this.

Yes, that’s my point. The threads which need to communicate (to create a sum of their values) are all in the same block. Threads that need to communicate in the same block => shared memory is your friend. :) A parallel reduction is a fast way to sum many values without requiring atomics, just a thread barrier. Since CUDA provides a very convenient and fast thread barrier for threads in the same block, parallel reduction in shared memory should be pretty quick.

This talk by Mark Harris explains the idea with pictures, although he is focused on a reduction in global memory, which means launching many kernels in a row:…c/reduction.pdf

There is lot of chance that “atomicCAS” spin – causing infinite loops… — It assumes certain behaviour on warp scheduler and warp divergence…

I have seen such deadlocks before…

I can only see that it could livelock, I don’t see how it could cause a deadlock.

Do you have a better solution?

Even simpler, if the sum at the end of your kernel is insignificant compared to the rest of the kernel runtime, you can even make thread 0 do the shared memory sum all by itself. No point in making your code prematurely complicated for a 5% speed boost.

:shock: The same problem also occurs to my code. could you explain a bit more on the assumption that atomicCAS makes on the warp schedular’s behavior and the divergence?