Tagging and waiting for success works safely only on per-warp basis, when each warp governs its own shared memory range. Shared memory tagging and update is split up into multiple GPU instructions, so generally speaking it’s not atomic. But since threads in a warp always execute the same instruction, the trick works within warps.
I was trying a very simple example considering that in a block at a time only one warp is active. Which means at a time only 32 threads are working in parallel. I have a input data stream of 64 elements all holding the value zero. My kernel has a single block with 64 threads. In the kernel each thread operates on a single element (ie thread 0 will use element 0 and thread 63 will use element 63 of the input array).
In the kernel I declare a single histogram array in shared memory as
shared unsigned int histos[BIN_COUNT][WARP_SIZE] (ie histos) which means that in a warp each thread has its own histogram. The same histogram array can be used for the next warp (second warp) as it is scheduled after the first warp.
In this case tagging is not needed as there should be no memory writes to the same location as each thread in a warp has its own histogram and two warps do not execute at the same time. If my assumptions are correct then when I sum all the rows of histos[BIN_COUNT][WARP_SIZE] to form the global histogram then the bin 0 should hold the value 64 but it does not. In fact it holds the value 32 which means that only one warp succeeded. Is that possible as only one warp executes at a time.