I have tried to implement a critical section method, following CUDA example for histogram based on their implementation of addData256:
__device__ void addData256(
volatile unsigned int *s_WarpHist,
unsigned int data,
unsigned int threadTag
){
unsigned int count;
do{
count = s_WarpHist[data] & 0x07FFFFFFU;
count = threadTag | (count + 1);
s_WarpHist[data] = count;
}while(s_WarpHist[data] != count);
}
This is my implementation:
__device__ void increament_shared_value(volatile unsigned* bins_sarr, uint idx,uint thread_rank)
{
uint threadTag = (thread_rank & 31U) << 10; //10 bits for information and 5 upper bits for tag
uint count_with_tag = -1;
uint data_without_tag = -1;
do
{
data_without_tag = bins_sarr[idx] & 0x03FFU;
count_with_tag = threadTag | (data_without_tag + 1);
bins_sarr[idx] = count_with_tag;
} while (count_with_tag != bins_sarr[idx]);
}
I run a test with idx equals to 0 which means that all threads should increment the same cell in shared memory and I expect to see the block size as a result in bins_sarr[0].
However, each running I get different result that isn’t equal to block size.
Of course I check the result that way: uint unified_result = bins_sarr[0] & 0x03FFU
This method is based on the idea that: "the hardware performs shared memory write
combining, resulting in acceptance of the tagged counter from one thread and rejection from
all other pending threads. "
I appreciate any kind of help toward the best solution.