Is there any way to emulate atomicAdd on GF8800 GTX (1.0 computing capabilities) ?
What I really need, is to be consistent with adding to one memory location from different grid items (not threads, couse only one thread from block (thread 0) outputs
to this memory location).
Currently i’m running kernels in the loop with GridDimension == 1 and putting,
cudaThreadSynchronize after that, something like this:
for (int i = 0; i < 1024; i++)
{
Kernel<<<1, 512>(somememory);
cudaThreadSynchronize();
}
what i want is:
Kernel<<<1024, 512>(somememory);
and some way to be able to add values to ‘somememory’ and be consistent ;)
any ideas ?
Well, if it is 20 times faster, you can run the kernel once to calculate the count values for each block, then use prefix-sum to convert them to the indexes you need and use them in the second real kernel call and you are still 10 times faster (even assuming you have to do the full work in both kernels, which is unlikely).
It might also make sense to e.g. use a small but constant “count” for the first kernel and already store those that fit and only re-run the blocks that did have parts that did not fit the second time.
Indeed, I also have a kernel where each thread outputs between 0 and (total_amount_of_threads/2) values. I also do first calculate how many outputs will be needed → num_out array.
Then I scan that array to get a new array base_index array.
Then I have the threads actually perform the output num_out times, starting at base_index.
It takes 3 kernel calls in total, but it uses only 1 extra array, and still performance is really good.