AtomicAdd algorithm

Hi all around here.
My card is GTS260 and compute capability is 1.0, doesn’t support Atomic function.
But in my program, I need to use atomic add function, so if anybody know the algorithm of atomic Add function or some suggestions, please help me.
thank you very much.

There is no card called a GTS260. There’s a GTX260 (compute 1.3), and a GTS250 (compute 1.1), though. Both support global atomics. The GTX260 also supports shared memory atomics.

You can, with annoyance, do atomic operations in shared memory even in archaic compute 1.0 hardware. It’s not efficient but it does work.

The basic technique is to designate a shared variable for voting. Every thread that wants to access an atomic writes its thread ID into this one variable. You syncthreads,

then look at the variable’s contents to see which thread “won” the write battle. Then that thread is treated as owning an exclusive lock, you can do any operations you like with it.

Then the loop is repeated until every thread that wants to do its atomic action has had its chance.

This method doesn’t work for global memory atomics, though. Depending on your application, you can often use multiple kernel launches, with compaction or reduction kernels inbetween, depending on what you’re using the atomics for.

Oh, You are right. My card is GTS 250.

Thank for your suggestion.

This is truly evil.

Can you be sure that every thread will eventually access this shared lock? I mean it is entirely possible that a single thread will win every time. Unless you additionally store thread-local flags of “I’ve been served already” to disable served threads from the race?

Oh, you bet it’s evil. But it has been useful in the past. You never do this for performance, you do this for situations like one thread finding a rare exceptional case and it needs to hold everybody back while it rearranges the whole block’s work lists or whatever. If the number of threads that need the lock is small, the performance actually isn’t even too bad.

I posted a thread on this a year ago. I’m sure it could be dug up.

You’re exactly right that you need to have each thread keep track of whether it’s had its turn or not.

Here’s some old example code. It uses a floating point add as an example, though the point is that it could be any operation you’d like.

A floating point add isn’t best done this way (just use a reduction!) but it’s just an example for the technique, not the result.

// Shared atomic operations in CUDA example 

// Steve Worley  July 2008. This code is released to the public domain.

#include <stdlib.h>

#include <stdio.h>

#include <cutil.h>

// Shared atomic operations in CUDA: a hack that works with all GPUs,

// even Device 1.0, and also allows any operation including large

// structures, doubles, multiple swaps, stacks, queues, etc.

// This is a fragile hack. It uses syncthreads()! So it's not something

// to mix with divergent warps.

// Basic strategy: use the fact that a shared write to one address  by

// multiple threads will have a single unique winner. Each thread writes

// its ID, and checks (after a sync) if it was the winner. If so, we've

// done a successful vote for a single thread and that one thread can do

// any operation it likes atomically.. remaining threads know they must

// wait. 

// Repeat the vote until every thread that wants to do an atomic

// operation has its turn. Inefficient but quite useful!


// volatile keywords likely not necessary, but CUDA 1.0 didn't like it

__device__ float SharedAtomicAdd(volatile float *address, 

								 float add, volatile int *sharedTemp)


  int tid=threadIdx.x; // for this example we assume a 1D block size

  float readVal;

  bool needToWrite=true; // is this thread waiting for its turn?

*sharedTemp=0xFFFFFFFF; // default, NO thread needs atomic

for (;;) {


	// if we have a job, flag  our interest!

	if (needToWrite) *sharedTemp=tid; 


	if (0xFFFFFFFF==*sharedTemp) {

	  //no thread tried to write a value, so everyone is done.

	  return readVal;



	if (needToWrite && *sharedTemp==tid) { // we won the vote!


	  // MY ATOMIC OPERATION STARTS HERE. It can be any code you like!

	  // We are guaranteed that only OUR thread is currently executing.





	  *sharedTemp=0xFFFFFFFF; // prepare for next loop




  return 0.0f; // can't reach here


__global__ void testKernel(float* g_odata) 


  extern __shared__ volatile int shareMem[];

  volatile int *sharedTemp=shareMem+1;

  volatile float *accumulator=(volatile float *)(shareMem+0);

  if (0==threadIdx.x) *accumulator=0.0f; // intialize to 0;


// have each thread add 1.0 three times. The net accumulated

  // sum should be 3.0*threadcount.

  for (int i=0; i<3; i++) 

	  SharedAtomicAdd(accumulator, 1.0f, sharedTemp);


  if (0==threadIdx.x) g_odata[0]=*accumulator; // result for host


int main( int argc, char** argv) 


  CUT_DEVICE_INIT(argc, argv);

// allocate device memory for result

  int mem_size=sizeof(float);

  float* d_odata;

  CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));

// allocate mem for the result on host side

  float* h_odata = (float*) malloc(mem_size);

for (int w=1; w<35; w++) {

	// setup execution parameters

	dim3  grid( 1, 1, 1);

	// try a variety of thread counts, whole & partial warps

	dim3  threads( 16*(w>>1)+5*(w&1), 1, 1); 


	// execute the kernel

	testKernel<<< grid, threads, 2*sizeof(float) >>>(d_odata);


	// check if kernel execution generated an error

	CUT_CHECK_ERROR("Kernel execution failed");


	// copy result from device to host

	CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size,

								cudaMemcpyDeviceToHost) );


	// Result is 3.0*thread count, each thread did 3 atomic adds of 1.0.

	printf("Output value: %d =%f\n", 3*threads.x, h_odata[0]);


  // cleanup memory

  free( h_odata);



Thank you Mr SPWorley.
Your code works fine.
For easy to read your code, I have edited and repost for someone, who is may need it.
But, I worry about the effective of this code.
I tested with 1block, and this block has 512 threads in my Geforce 8800GT.
the result was correctly.
I used cudavisual profiler to evaluate your code.
the branch, divergent branch, instructions and warp serialize counters were too high.
branch: 73246
divergent branch: 7743
instructions : 137921
warp serialize: 115503

Thank you very much. :) (1.52 KB)

For my own purpose (as I have a Geforce 8800 GT Compute Capability 1.0), I developped another algorithm:

  • each thread on a warp is writing on a shared memory byte, to indicate if it needs to be served for atomic operation (or not)
  • then each warp (using one thread) writes to global memory the number of atomic operation needed to process (and parameters) to indicate that it needs to be served
  • finally one thread on one warp serves atomic operations on a round-robin basis.


  • we aggregate threads demands within shared memory
  • we only write global memory if one thread (or more)of local memory needs to be served
  • with round-robin we ensure maximal service time

Another optimization is that I render atomic operations ASYNCHRONOUS, with a query, polled this way, and the result pulled back to shared memory.
And in the meantime each thread are not blocked, but continues to compute data.
And that is totally valuable, even with Compute Device 1.1+, because lack of registers may limit the number of warps you may launch per MultiProcessor, and finally you will end-up with many blocked threads.

This is unavoidable with atomic operations. If you want to do better you should write your algorithm in a way that it does not require atomic operations. E.g. duplicate counters instead and do a reduction.