Good solution?

Hi,

my device does not support atomic functions in shared memory. That is why I had to create a workaround:

#define WARPSIZE			32

#define LOG2_WARPSIZE		5

#define BITMASK_THREADID	(0xFFFFFFFF << (WARPSIZE - LOG2_WARPSIZE))

#define BITMASK_VALUE		(0xFFFFFFFF >> LOG2_WARPSIZE)

__device__ void sharedAtomicIncrement(unsigned int* s_address)

{

	unsigned int count;

	do

	{

		//Read shared memory and increment count

		count = *s_address & BITMASK_VALUE;

		count++;

		//Write value

		*s_address = count | (threadIdx.x << (WARPSIZE - LOG2_WARPSIZE));

	} while ((*s_address & BITMASK_THREADID) != (threadIdx.x << (WARPSIZE - LOG2_WARPSIZE)));

}

The basic idea is that every thread reads the value at s_address. Afterwards the thread increases the value and writes a unique id for every thread in the warp into the 5 most significant bits. The thread attempts to write to the address until it can read its id from memory.

I know, that active waiting is not an elegant solution, but I could not think of any other. At least I can gurantee a worst case of 32 write attempts.

Do you have any better ideas?

Regards,

Kwyjibo

You could replicate the array n times, let n threads do the counting and sum up the replicated entries in the end. Doing the counting with just one single thread might also be feasible if there is a sufficient amount of parallel work besides counting.