Hi,
my device does not support atomic functions in shared memory. That is why I had to create a workaround:
#define WARPSIZE 32
#define LOG2_WARPSIZE 5
#define BITMASK_THREADID (0xFFFFFFFF << (WARPSIZE - LOG2_WARPSIZE))
#define BITMASK_VALUE (0xFFFFFFFF >> LOG2_WARPSIZE)
__device__ void sharedAtomicIncrement(unsigned int* s_address)
{
unsigned int count;
do
{
//Read shared memory and increment count
count = *s_address & BITMASK_VALUE;
count++;
//Write value
*s_address = count | (threadIdx.x << (WARPSIZE - LOG2_WARPSIZE));
} while ((*s_address & BITMASK_THREADID) != (threadIdx.x << (WARPSIZE - LOG2_WARPSIZE)));
}
The basic idea is that every thread reads the value at s_address. Afterwards the thread increases the value and writes a unique id for every thread in the warp into the 5 most significant bits. The thread attempts to write to the address until it can read its id from memory.
I know, that active waiting is not an elegant solution, but I could not think of any other. At least I can gurantee a worst case of 32 write attempts.
Do you have any better ideas?
Regards,
Kwyjibo