Implementation of critical section for addition

I have tried to implement a critical section method, following CUDA example for histogram based on their implementation of addData256:

__device__ void addData256(
volatile unsigned int *s_WarpHist,
unsigned int data,
unsigned int threadTag
){
unsigned int count;
do{
count = s_WarpHist[data] & 0x07FFFFFFU;
count = threadTag | (count + 1);
s_WarpHist[data] = count;
}while(s_WarpHist[data] != count);

}

This is my implementation:

__device__ void increament_shared_value(volatile unsigned* bins_sarr, uint idx,uint thread_rank)
{
	uint threadTag = (thread_rank & 31U) << 10; //10 bits for information and 5 upper bits for tag 
	uint count_with_tag = -1;
	uint data_without_tag = -1;

	do
	{
		data_without_tag = bins_sarr[idx] & 0x03FFU;
		count_with_tag = threadTag | (data_without_tag + 1);
		bins_sarr[idx] = count_with_tag;
	} while (count_with_tag != bins_sarr[idx]);

}

I run a test with idx equals to 0 which means that all threads should increment the same cell in shared memory and I expect to see the block size as a result in bins_sarr[0].
However, each running I get different result that isn’t equal to block size.
Of course I check the result that way: uint unified_result = bins_sarr[0] & 0x03FFU

This method is based on the idea that: "the hardware performs shared memory write
combining, resulting in acceptance of the tagged counter from one thread and rejection from
all other pending threads. "

I appreciate any kind of help toward the best solution.

The addData256 function is intended to serve a warp-private histogram:

https://pure.tue.nl/ws/portalfiles/portal/8808771/20151125_Braak.pdf

It seems fairly evident that you cannot use it by multiple warps at the same time, in the same threadblock, operating on the same histogram (bins). Do you get expected results if you limit your block size to 32?

It’s hard to make any suggestions about what would be better here, without knowing your intent. This algorithm mostly appears to be a clever way to craft the equivalent of atomics without using atomics. However modern GPUs have atomics and they are pretty fast. If you simply want a more general critical section I would suggest something like this:

I am trying to implement histogram of 1024.
Setting the block to size of 32 or 64 (Turing architecture) solves this issue since two threads are executed in parallel.
I have to use atomicAdd. I have seen that since Maxwell atomics are better supported with shared memory read/write operations.

Thank you for the assistance.