Is there any way to emulate atomicAdd on GF8800 GTX (1.0 computing capabilities) ?

What I really need, is to be consistent with adding to one memory location from different grid items (not threads, couse only one thread from block (thread 0) outputs
to this memory location).

Currently i’m running kernels in the loop with GridDimension == 1 and putting,
cudaThreadSynchronize after that, something like this:

for (int i = 0; i < 1024; i++)
Kernel<<<1, 512>(somememory);

what i want is:
Kernel<<<1024, 512>(somememory);
and some way to be able to add values to ‘somememory’ and be consistent ;)
any ideas ?

output your values to an array of size 1024, and afterwards perform a reduction. This is not only working on all hardware, but also often faster.

I cannot, subsequent blocks need old value, the logic is something like this:

global void fooKernel(unsigned int *memCNT, float *memOUT)


// calculate some kernel based value called ‘count’

// this is actually number of floats that we need to grab from buffer memOUT

shared idx;

if (threadIdx.x == 0)


 idx = atomicAdd(mem, count);




// now all threads of a block output their stuff at the ‘allocated’ memory fragment


memOUT[idx+something] = someCalculatedValues;


each block have 512 threads and the number of blocks is >= 1024

is there any chance to run something like this on 1.0 hardware ?

I cannot make ‘count’ const because this leads to huge memory waste

(assuming count at it max for each block would require 1GB of memory for ‘memOUT’)

for most of the time ‘count’ is 1 or 2, but in some situation is 100 or 1000

but they are rare.

Running block after block & syncing between is slower than doing whole work on CPU

but running all the blocks at once on 8800 GT (witch is 1.1) is about 20 times faster

than optimized CPU solution, so i’m in hunt for solution that works on 8800 GTX too ;)

Well, if it is 20 times faster, you can run the kernel once to calculate the count values for each block, then use prefix-sum to convert them to the indexes you need and use them in the second real kernel call and you are still 10 times faster (even assuming you have to do the full work in both kernels, which is unlikely).

It might also make sense to e.g. use a small but constant “count” for the first kernel and already store those that fit and only re-run the blocks that did have parts that did not fit the second time.

Indeed, I also have a kernel where each thread outputs between 0 and (total_amount_of_threads/2) values. I also do first calculate how many outputs will be needed -> num_out array.
Then I scan that array to get a new array base_index array.

Then I have the threads actually perform the output num_out times, starting at base_index.

It takes 3 kernel calls in total, but it uses only 1 extra array, and still performance is really good.