Multiple writes to the same location using thread


I have been following the earlier posts regarding multiple writes to the same memory location. All talk about using a threadId tag to check if the write was successful.

I tried the same thing for simple examples and like the others didnt manage to get it to work.

Does anybody succeed with this technique.

It would be a great help as I am new to CUDA and to GPU programming.



Tagging and waiting for success works safely only on per-warp basis, when each warp governs its own shared memory range. Shared memory tagging and update is split up into multiple GPU instructions, so generally speaking it’s not atomic. But since threads in a warp always execute the same instruction, the trick works within warps.

Hi Victor,

Thank you for the advice.

I was trying a very simple example considering that in a block at a time only one warp is active. Which means at a time only 32 threads are working in parallel. I have a input data stream of 64 elements all holding the value zero. My kernel has a single block with 64 threads. In the kernel each thread operates on a single element (ie thread 0 will use element 0 and thread 63 will use element 63 of the input array).

In the kernel I declare a single histogram array in shared memory as

shared unsigned int histos[BIN_COUNT][WARP_SIZE] (ie histos[32][32]) which means that in a warp each thread has its own histogram. The same histogram array can be used for the next warp (second warp) as it is scheduled after the first warp.

In this case tagging is not needed as there should be no memory writes to the same location as each thread in a warp has its own histogram and two warps do not execute at the same time. If my assumptions are correct then when I sum all the rows of histos[BIN_COUNT][WARP_SIZE] to form the global histogram then the bin 0 should hold the value 64 but it does not. In fact it holds the value 32 which means that only one warp succeeded. Is that possible as only one warp executes at a time.