Good solution?


my device does not support atomic functions in shared memory. That is why I had to create a workaround:

#define WARPSIZE			32

#define LOG2_WARPSIZE		5



__device__ void sharedAtomicIncrement(unsigned int* s_address)


	unsigned int count;



		//Read shared memory and increment count

		count = *s_address & BITMASK_VALUE;


		//Write value

		*s_address = count | (threadIdx.x << (WARPSIZE - LOG2_WARPSIZE));

	} while ((*s_address & BITMASK_THREADID) != (threadIdx.x << (WARPSIZE - LOG2_WARPSIZE)));


The basic idea is that every thread reads the value at s_address. Afterwards the thread increases the value and writes a unique id for every thread in the warp into the 5 most significant bits. The thread attempts to write to the address until it can read its id from memory.

I know, that active waiting is not an elegant solution, but I could not think of any other. At least I can gurantee a worst case of 32 write attempts.

Do you have any better ideas?



You could replicate the array n times, let n threads do the counting and sum up the replicated entries in the end. Doing the counting with just one single thread might also be feasible if there is a sufficient amount of parallel work besides counting.