How can I control concurrent write access to global memory?


I have some code like this:

Host code:

unsigned int* RES; cudaMalloc((void**)&RES, 11 * sizeof(int));

Device code:

int index;


if(result == interesting)




  if(index<10)				/* limit to 10 results */


	RES[index + 1]=somedata;



I run billions of threads (~1M threads per grid) and usually there is less than one “interesting” result per one billion threads but it rare casses there are two (or more) “interesting” results close together and the code above might write only one of them into the RES array (overwriting the results).

I think this could be avoided with atomic intructions? But this won’t work on G80, right?

Any suggestions?


Atomic operations would indeed solve your problem. They would work on the vast majority of Cuda cards - the ONLY exceptions are the 8800 GTX, the 8800 GTS, and the first generation Tesla (and maybe one of the Quadro series). Other cards such as the 8800 GT or the 8600 GTS have compute capability 1.1, and so can do atomic operations.

For the G80 corner case, you could do a reduction of an array of per-warp counters, which would consume about 128KB for 1M threads (1M threads * 4 bytes/thread / 32 threads/warp). You could then use the data in the counter array to “allocate” space for the data output of each interesting result, and extract the data from a second run from the same input. To get the per-warp counts, you’d use a warp size reduction, which wouldn’t need either atomics or syncthreads since the threads in a warp run in lockstep.

Another thing to consider: would missing an interesting result be catastrophic? I say this because it looks like the chance of a random collision actually occurring are vanishingly small, probably conservatively on the order of 1 in 10 million interesting results if the results are truly randomly spread…

This is generally a bad thing to try and rely upon!