atomicAdd

DarkAr · September 8, 2008, 3:35pm

Hi

Is there any way to emulate atomicAdd on GF8800 GTX (1.0 computing capabilities) ?

What I really need, is to be consistent with adding to one memory location from different grid items (not threads, couse only one thread from block (thread 0) outputs
to this memory location).

Currently i’m running kernels in the loop with GridDimension == 1 and putting,
cudaThreadSynchronize after that, something like this:

for (int i = 0; i < 1024; i++)
{
Kernel<<<1, 512>(somememory);
cudaThreadSynchronize();
}

what i want is:
Kernel<<<1024, 512>(somememory);
and some way to be able to add values to ‘somememory’ and be consistent ;)
any ideas ?

E.D_Riedijk · September 8, 2008, 6:09pm

output your values to an array of size 1024, and afterwards perform a reduction. This is not only working on all hardware, but also often faster.

DarkAr · September 9, 2008, 7:27am

I cannot, subsequent blocks need old value, the logic is something like this:

global void fooKernel(unsigned int *memCNT, float *memOUT)

{

// calculate some kernel based value called ‘count’

// this is actually number of floats that we need to grab from buffer memOUT

shared idx;

if (threadIdx.x == 0)

{

 idx = atomicAdd(mem, count);

}

__syncthreads();

//

// now all threads of a block output their stuff at the ‘allocated’ memory fragment

//

memOUT[idx+something] = someCalculatedValues;

}

each block have 512 threads and the number of blocks is >= 1024

is there any chance to run something like this on 1.0 hardware ?

I cannot make ‘count’ const because this leads to huge memory waste

(assuming count at it max for each block would require 1GB of memory for ‘memOUT’)

for most of the time ‘count’ is 1 or 2, but in some situation is 100 or 1000

but they are rare.

Running block after block & syncing between is slower than doing whole work on CPU

but running all the blocks at once on 8800 GT (witch is 1.1) is about 20 times faster

than optimized CPU solution, so i’m in hunt for solution that works on 8800 GTX too ;)

Reimar · September 9, 2008, 8:32am

Well, if it is 20 times faster, you can run the kernel once to calculate the count values for each block, then use prefix-sum to convert them to the indexes you need and use them in the second real kernel call and you are still 10 times faster (even assuming you have to do the full work in both kernels, which is unlikely).

It might also make sense to e.g. use a small but constant “count” for the first kernel and already store those that fit and only re-run the blocks that did have parts that did not fit the second time.

E.D_Riedijk · September 9, 2008, 10:05am

Indeed, I also have a kernel where each thread outputs between 0 and (total_amount_of_threads/2) values. I also do first calculate how many outputs will be needed → num_out array.
Then I scan that array to get a new array base_index array.

Then I have the threads actually perform the output num_out times, starting at base_index.

It takes 3 kernel calls in total, but it uses only 1 extra array, and still performance is really good.

Topic		Replies	Views
Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...? CUDA Programming and Performance	14	5765	January 2, 2008
AtomicAdd algorithm CUDA Programming and Performance	7	3693	August 25, 2009
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1377	March 4, 2019
Compute Capability 1.0 and atomic functions CUDA Programming and Performance	2	3004	June 24, 2009
Really simple while loop issues CUDA Programming and Performance	4	3084	October 27, 2014
atomicAdd crash CUDA Programming and Performance	8	1302	August 25, 2016
NaN CUDA Programming and Performance	9	6059	March 2, 2009
atomic add operation CUDA Programming and Performance	2	4228	July 22, 2014
Strange Performance Problem using atomicAdd CUDA Programming and Performance	4	1602	September 12, 2010
What I am doing wrong with atomicAdd() CUDA Programming and Performance	5	2319	November 1, 2010

atomicAdd

Related topics