atomicInc for shared memory with CC1.1


I have got a CC 1.1 device (no atomic functions for shared memory) and need an atomicInc function for shared memory. I use the following implementation adapted from the whitepaper on histogram generation from the SDK:

__device__ void dirtyAtomicInc(volatile unsigned int *address, unsigned int threadTag)


	unsigned int count;


		count = *address & 0x0000FFFF;

		count = threadTag | (count + 1);

		*address = count;

	} while(*address != count);


The function splits the location to write to into two parts: the 2 most significant bytes contain the thread tag (e.g. the thread id), the 2 least significant bytes contain the value. Both parts are updated in the while loop and repeatedly written until the thread can read what it has written.

I call the function like this:

dirtyAtomicInc(&s_histo[index], threadIdx.x << 16);

However it does not seem to work, as I get race conditions. Any ideas?

Thanks in advance,


This function works only if only threads from the same warp access the same memory location.

The simple way is for-loop.

Suppose you have 128 threads per thread block, then

following for-loop can keep exclusive executions.

for(int i = 0 ; i < 128 ; i++){

    if ( i == tid ){

        count = s_histo[index] & 0x0000FFFF;

        count = (threadIdx.x << 16) | (count + 1);

        s_histo[index] += count ;



However penalty is HUGE, I will not suggest this workaround.

Hi again, tera,

can you explain why? I cannot see why it shouldnt work for all threads?



If the threads do not run in lock-step, one thread may finish the complete read-modify-write-readback cycle in between the read and write of another thread. There is nothing in this code to prevent or detect that.
The loop in this code only triggers if the second write happens between the first write and the readback,