Race condition?


I posted this question in the Linux section but I didnt’ get a reply … so I’ll try it in this forum.

can someone explain to me why in the following case an atomicAdd is needed?

My kernel does a certain check on an array, the index of which is determined by the thread ID. Depending on the result of the check,
it sets a global (device) variable “update” to 1, otherwise it leaves it untouched.

Like this:

if (condition depending on array data) update=1;

So it is possible that multiple threads find the same condition fulfilled, and hence they may simultaneously set “update” to 1. But that shouldn’t matter once it has been updated at least once. However this seems to be a race condition, since sometimes after reading out “update” from the host code, it is 0 when it should actually be 1. Using atomicAdd(&updatenl,1) in the kernel solves this problem.

Can anybody enlighten me on the type of mistake I made?

kind regards,

If there is no __syncthreads() between writing the global variable and threads reading it then it is not defined wether a thread that reads it will see the updated value or the old one.


yes, I understand. But I am actually reading out the device variable via cudaMemcpy AFTER the kernels have finished. So in this case it shouldn’t matter, whether one or many kernels set update=1, as long as it has been set to 1 at least once (before the kernel’s launch it was 0). However this does seem to constitute a race condition…

kind regards,


I tought that kernel calls were asynchronous, would you need a cudaThreadSynchronize(); before the cudaMemcpy or the next cuda function call will wait for the completion?

cudaMemcpy() has an implicit thread synchronize. It will not run until the previously queued kernels have finished executing and all writes have been flushed to the CUDA device memory.

Your global device variable could reside in the same minimum read unit (or cache line in cpu term) with other variables.

Try to make the memory for your global variable to 128 bytes or more to see if this problem goes way.

Could you provie a simple code ready to compile and run which would exemplify the problem?
Also, what version of CUDA you are using? If I am not mistaken some old version of the compile was able to transform the code into
which is perfectly fine for single threaded app but totally wrong in parallel program. Nevertheless I hope it is not the case here.