write to global memory from multiple threads and racing conditions

just want to make sure, I have a kernel running over many threads (1024 typically), for each thread, it calculate a number and add to a stream in the global memory, something like

__global__ void mykern(float *data){




this is basically a scatter operation. I am wondering if there is any issue with the above implementation? is CUDA going to take care of the racing conditions when writing from multiple threads?

If you have multiple threads writing to the same location without any sort of prevention, yes, that’s a race condition. There’s no transactional memory or anything like that.

Unfortunately there’s no race condition detection offered by nvidia tools. There is a paper about this and related project info at http://www.cs.virginia.edu/~mwb7w/cuda/.

thanks for the reply. basically you are saying if two threads modify the same global memory location, and the results will be unpredictable, is this correct?

notice that in my kernel, I used data[idx]+=…, which is an accumulation. Is this going to get around the issue? Also, if multiple threads modify a single address in the global memory, is there a lock present when performing this transaction to force a sequential modification?

The reason I ask because I used this in my program, and found the results are fine. There are 1024 threads to write (accumulate) randomly to a global memory of size 606060. I want to understand if it did it by accident or by design.