Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...?

preetib · December 7, 2007, 11:07am

Hi,

We are using GTX 8800.
Can we use “AtomicAdd()” in our code?

If not, then what should we do to implement same functionality??

Thanks in advance. :)

seibert · December 7, 2007, 1:50pm

the 8800 GTX (and GTS) do not support any atomic functions. I’ve seen hacks posted in the forum to work around this, but it would be safest to avoid them.

preetib · December 10, 2007, 12:10pm

Thanks for the information.

But can u tell me what should I do to implement the same functionality without using Atomic APIs.

seibert · December 10, 2007, 12:12pm

You would have to explain what you are trying to do with the atomic add for anyone to suggest an alternative.

preetib · December 12, 2007, 6:11am

In my kernel application, first all threads do some calculation and get some results and then I want to add all threads’ result in single variable to calculate average result.

So what to do to implement this thing without using AtomicAdd() API?

DenisR · December 12, 2007, 6:55am

Check the reduction example. It calculates the sum of a vector. If you adapt it to know the number of elements, you can already divide by the amount of elements when adding them up and end up with the average.

Eri_Rubin · December 12, 2007, 7:31am

the only way to collapse the result is some thing like the reduction example, which requires multiple kernel calls (each time u need to exit to make sure all the blocks are synced)

preetib · December 12, 2007, 10:09am

From where can I find “reduction example”?

MisterAnderson42 · December 12, 2007, 12:51pm

[url=“http://developer.download.nvidia.com/compute/cuda/1_1/Website/Data-Parallel_Algorithms.html#reduction”]http://developer.download.nvidia.com/compu....html#reduction[/url]

Or just look in your CUDA SDK directory.

DenisR · December 12, 2007, 2:33pm

You can also do it with just 1 kernel call something like:

__shared__ temp_array[NUM_THREADS]

temp_array[threadIdx.x] = 0.0f;

for(unsigned int offset = 0; offset < number_of_calculations; offset +=NUM_THREADS)

    temp_array[threadIdx.x] += some_calculation(offset + threadIdx.x)

// now comes the reduction code from the example

output = temp_array[0];

preetib · December 13, 2007, 10:24am

You can also do it with just 1 kernel call something like:

__shared__ temp_array[NUM_THREADS]

temp_array[threadIdx.x] = 0.0f;

for(unsigned int offset = 0; offset < number_of_calculations; offset +=NUM_THREADS)

    temp_array[threadIdx.x] += some_calculation(offset + threadIdx.x)

// now comes the reduction code from the example

output = temp_array[0];

[snapback]293316[/snapback]

Your suggestion seems to be useful. But I am not getting it.

What should I put in place of “// now comes…”?

Can u please explain in some detail?

Thanks :)

DenisR · December 13, 2007, 10:54am

__syncthreads();

   // do reduction in shared mem

    for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

        if (tid < s) {

            temp_array[tid] += temp_array[tid + s];

        }

        __syncthreads();

    }

tid is threadIdx.x (you should assign unsigned int tid = threadIdx.x; in the beginning of your kernel.

But this is not the fastest version as you have seen in the reduction example, so you can replace this part with a faster version from the example

Eri_Rubin · December 13, 2007, 11:55am

sorry for not being clear, the reduction only works to the level of the block, meaning that if you have 1000 elements and your block size is 256 then buy the end of the first reduction you will still have 4 elements instead of one. if you have a much bigger number of elements you might need more kernel calls just of the reduction. What i did to minimize the overhead of launching multiple kernels (which can be significant) is run the first reduction with at the end of the execution of the running kernel. and the last reduction at the begging of the next kernel (notice for this you are doing the same calculation in each block, so that some extra calculations. but i found that its faster then launching another kernel ).
and if my data set is big i run some more reduction kernels in between.

;)

DenisR · December 13, 2007, 2:10pm

__shared__ float temp_array[NUM_THREADS];

unsigned int tid = threadIdx.x;

temp_array[tid] = 0.0f;

for(unsigned int offset = 0; offset < number_of_calculations; offset +=NUM_THREADS)

   temp_array[tid] += some_calculation(offset + tid);

__syncthreads();

// do reduction in shared mem

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

   if (tid < s) {

       temp_array[tid] += temp_array[tid + s];

   }

   __syncthreads();

}

output_array[blockIdx.x] = temp_array[0];

I use this in my kernel, where each block calculates one element in an output array. Here I make sure that all elements of my reduction (number_of_calculations) are being processed in 1 block, so I do not need multiple kernel invocations.

Neeraj_Kulkarni · January 2, 2008, 1:47am

sorry for not being clear, the reduction only works to the level of the block, meaning that if you have 1000 elements and your block size is 256 then buy the end of the first reduction you will still have 4 elements instead of one. if you have a much bigger number of elements you might need more kernel calls just of the reduction. What i did to minimize the overhead of launching multiple kernels (which can be significant) is run the first reduction with at the end of the execution of the running kernel. and the last reduction at the begging of the next kernel (notice for this you are doing the same calculation in each block, so that some extra calculations. but i found that its faster then launching another kernel ).

and if my data set is big i run some more reduction kernels in between.

;)

[snapback]293849[/snapback]

I have been working on similar problems for the last few months.

I have programmed a few “hacks” that work for atomic computations at block level.

However the streaming architecture of the GPU is not designed for such constructs leading to possible dead locks!

I have already posted ways of doing this and is achieved by spin - loops + global writes!

http://forums.nvidia.com/index.php?showtopic=44144

Read from Memory

Work in parallel

Reduce in parallel (for threads within a single MP)

Reduce serially using the modified programming constructs.

Reduction + Block level synchronization + Memory optimization = very high performance gains

In short,Use the constructs as tools of getting around the problem not as a concrete reference!

I hope this helps

Cheers,

Neeraj

Topic		Replies	Views
atomicAdd CUDA Programming and Performance	4	3487	September 9, 2008
Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...? CUDA Programming and Performance	0	1189	December 7, 2007
atomic add CUDA Programming and Performance	4	4724	March 20, 2008
Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...? CUDA Programming and Performance	1	2251	December 7, 2007
Using reduction instead of atomics? CUDA Programming and Performance	9	6086	March 9, 2015
Can I avoid using AtomicAdd with this kernel ??? CUDA Programming and Performance	9	3744	January 26, 2015
Hybrid Atomic Reduction CUDA Programming and Performance	0	712	June 24, 2013
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	10328	September 30, 2022
reduction centric design forces. should Iconsider atomic increment rather than classic reduction CUDA Programming and Performance	0	539	April 4, 2012
Parallel Reduction CUDA Programming and Performance	2	1230	July 8, 2010

Can we use "AtomicAdd()" with GTX 8800? Any other option to do same thing...?

Related topics