Useful Arbitrary Atomic Operation Hack

I posted to a thread about alternatives to shared atomic operations on G80 hardware, and I got several private messages asking about the technique. This is my workaround for the problem… it’s easy enough to understand and in fact extend. One cool feature for all hardware, useful even on a G200 GPU, is it allows arbitrary atomic operations, even large structure operations or swaps, math ops on doubles, even function calls.

The goal of atomics is to let just one thread at a time update or increment or swap some value. We may not have atomic hardware support on all GPUs, but we do have one small guarantee from CUDA SDK we can use. If multiple threads simultaneously write to a memory location, we can’t be sure if “our” thread will have its value written successfuly, but we are explicitly told that one thread will succeed.

We use that guarantee to let the threads coordinate and agree on a single thread to do its atomic operation first. We use one shared memory location as a “voting” coordinator. Every thread that wants to do an atomic op writes its own threadID into this coordination address. We sync threads, then every thread looks at the written result, and the single thread whose write succeeded is now known by all threads, and that one thread can do any operation it likes (such as incrementing a shared memory variable) with a guarantee that no other thread will be doing it simultaneously. We repeat the vote until all threads that want to do an atomic operation have had their chance.

This is inefficient, since we have lots of threads testing and writing. It also is ugly because we’re using __syncthreads() which is a blunt hammer and in particular doesn’t like to be mixed by divergent warps that also use __syncthreads(). But if our app knows these limitations, we can still use the idea.

One very positive feature effect of this manual atomic method is that it allows arbitrary atomic computations, even multiple instructions or function calls. I’m using it for priority queues. A thread wants to pop a value from a stack atomically. This needs two dependent operations, decrement of a counter and read of a variable. Adding to a stack is similar. Since I can do these atomically, I can create a stack of “jobs” and have threads pop off work as they need it. Obviously this is very useful for load balancing computations where some threads finish before others.

The strategy won’t extend to global atomics, since this strategy uses __syncthreads() to coordinate the warps of a block.

There are many situations where atomics work but there are better alternatives. If you’re just making a block-wide accumulator, it may be better to have each thread accumulate independently and then just do a log2 step reduction at the end. But there are plenty of computations which need atomics.

Sample code is below. It’s annoying to make a general library, since the coordination variable location has to be wired into it, and the operation itself is arbitrary (for the example below I show a floating point accumulator.) In this example I use the first shared memory location as the coordinating address. The code below should also be easy enough to modify to change the operation to anything you like, not just adding a float.

// Shared atomic operations in CUDA example 

// Steve Worley  July 2008. This code is released to the public domain.

#include <stdlib.h>

#include <stdio.h>

#include <cutil.h>

// Shared atomic operations in CUDA: a hack that works with all GPUs, even Device 1.0, and also

// allows any operation including large structures, doubles, multiple swaps, stacks, queues, etc.

// This is a fragile hack. It uses syncthreads()! So it's not something to mix with divergent warps.

// Basic strategy: use the fact that a shared write to one address  by multiple threads will have a 

// single unique winner. Each thread writes its ID, and checks (after a sync) if it was the winner.

// If so, we've done a successful vote for a single thread and that one thread can do any operation it

// likes atomically.. remaining threads know they must wait. 

// Repeat the vote until every thread that wants to do an atomic operation has its turn.

// Inefficient but quite useful!

__device__ float SharedAtomicAdd(volatile float *address, float add, volatile int *sharedTemp)

{

	int tid=threadIdx.x; // for this example we assume a 1D block size

	float readVal;

	bool needToWrite=true; // is this thread waiting for its turn?

	*sharedTemp=0xFFFFFFFF; // default value, will show if NO thread asks for an atomic op

	

	for (;;) {

  __syncthreads();

     if (needToWrite) *sharedTemp=tid; // if we have a job, register our interest!

  __syncthreads();

  if (0xFFFFFFFF==*sharedTemp) {

  	//no thread tried to write a value, so everyone is done.

  	return readVal;

  }

  __syncthreads();

  if (needToWrite && *sharedTemp==tid) { // Our thread won the vote!

 	// MY ATOMIC OPERATION STARTS HERE. It can be any code you like!

  	// We are guaranteed that only OUR thread is currently executing.

  	readVal=*address;

  	*address=readVal+add;

  	// ATOMIC OPERATION ENDS HERE

 	*sharedTemp=0xFFFFFFFF; // prepare for next loop

  	needToWrite=false; 

  }

	}

	return 0.0f; // can't reach here

}

__global__ void testKernel(float* g_odata) 

{

	extern __shared__ volatile int shareMem[];

	volatile int *sharedTemp=shareMem+1;

	volatile float *accumulator=(volatile float *)(shareMem+0);

	if (0==threadIdx.x) *accumulator=0.0f; // intialize to 0;

	__syncthreads();	

	// have each thread add 1.0 three times. The net accumulated sum should be

	// 3.0*threadcount.

	for (int i=0; i<3; i++) SharedAtomicAdd(accumulator, 1.0f, sharedTemp);

	

    __syncthreads();

	if (0==threadIdx.x)	g_odata[0]=*accumulator; // report result back to host

}

int main( int argc, char** argv) 

{

    CUT_DEVICE_INIT(argc, argv);

   // allocate device memory for result

	int mem_size=sizeof(float);

    float* d_odata;

    CUDA_SAFE_CALL( cudaMalloc( (void**) &d_odata, mem_size));

	// allocate mem for the result on host side

    float* h_odata = (float*) malloc(mem_size);

	for (int w=1; w<35; w++) {

    // setup execution parameters

    dim3  grid( 1, 1, 1);

    dim3  threads( 16*(w>>1)+5*(w&1), 1, 1); // try a variety of thread counts, whole & partial warps

   // execute the kernel

    testKernel<<< grid, threads, 2*sizeof(float) >>>(d_odata);

   // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

   // copy result from device to host

    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size, cudaMemcpyDeviceToHost) );

	// Result should be 3 * thread count, since each thread did 3 atomic adds of 1.0.

	printf("Output value: %d =%f\n", 3*threads.x, h_odata[0]);

	} 

    // cleanup memory

    free( h_odata);

   CUDA_SAFE_CALL(cudaFree(d_odata));

}