Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...?


We are using GTX 8800.
Can we use “AtomicAdd()” in our code?

If not, then what should we do to implement same functionality??

Thanks in advance. :)

the 8800 GTX (and GTS) do not support any atomic functions. I’ve seen hacks posted in the forum to work around this, but it would be safest to avoid them.

Thanks for the information.

But can u tell me what should I do to implement the same functionality without using Atomic APIs.

You would have to explain what you are trying to do with the atomic add for anyone to suggest an alternative.

In my kernel application, first all threads do some calculation and get some results and then I want to add all threads’ result in single variable to calculate average result.

So what to do to implement this thing without using AtomicAdd() API?

Check the reduction example. It calculates the sum of a vector. If you adapt it to know the number of elements, you can already divide by the amount of elements when adding them up and end up with the average.

the only way to collapse the result is some thing like the reduction example, which requires multiple kernel calls (each time u need to exit to make sure all the blocks are synced)

From where can I find “reduction example”?…html#reduction

Or just look in your CUDA SDK directory.

You can also do it with just 1 kernel call something like:

__shared__ temp_array[NUM_THREADS]

temp_array[threadIdx.x] = 0.0f;

for(unsigned int offset = 0; offset < number_of_calculations; offset +=NUM_THREADS)

    temp_array[threadIdx.x] += some_calculation(offset + threadIdx.x)

// now comes the reduction code from the example

output = temp_array[0];

Your suggestion seems to be useful. But I am not getting it.

What should I put in place of “// now comes…”?

Can u please explain in some detail?

Thanks :)


   // do reduction in shared mem

    for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

        if (tid < s) {

            temp_array[tid] += temp_array[tid + s];




tid is threadIdx.x (you should assign unsigned int tid = threadIdx.x; in the beginning of your kernel.

But this is not the fastest version as you have seen in the reduction example, so you can replace this part with a faster version from the example

sorry for not being clear, the reduction only works to the level of the block, meaning that if you have 1000 elements and your block size is 256 then buy the end of the first reduction you will still have 4 elements instead of one. if you have a much bigger number of elements you might need more kernel calls just of the reduction. What i did to minimize the overhead of launching multiple kernels (which can be significant) is run the first reduction with at the end of the execution of the running kernel. and the last reduction at the begging of the next kernel (notice for this you are doing the same calculation in each block, so that some extra calculations. but i found that its faster then launching another kernel ).
and if my data set is big i run some more reduction kernels in between.


__shared__ float temp_array[NUM_THREADS];

unsigned int tid = threadIdx.x;

temp_array[tid] = 0.0f;

for(unsigned int offset = 0; offset < number_of_calculations; offset +=NUM_THREADS)

   temp_array[tid] += some_calculation(offset + tid);


// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

   if (tid < s) {

       temp_array[tid] += temp_array[tid + s];




output_array[blockIdx.x] = temp_array[0];

I use this in my kernel, where each block calculates one element in an output array. Here I make sure that all elements of my reduction (number_of_calculations) are being processed in 1 block, so I do not need multiple kernel invocations.

I have been working on similar problems for the last few months.

I have programmed a few “hacks” that work for atomic computations at block level.

However the streaming architecture of the GPU is not designed for such constructs leading to possible dead locks!

I have already posted ways of doing this and is achieved by spin - loops + global writes!

Read from Memory

Work in parallel

Reduce in parallel (for threads within a single MP)

Reduce serially using the modified programming constructs.

Reduction + Block level synchronization + Memory optimization = very high performance gains

In short,Use the constructs as tools of getting around the problem not as a concrete reference!

I hope this helps