Atomicinc with independent thread scheduling

I have a CUDA program where different threads (of the same block) call atomicInc on the same counter variable, and sometimes get the same result value.

This is a full program (main.cu) that shows the problem:

The CMakeLists.txt:

Must be compiled with -DCMAKE_BUILD_TYPE=Release.


Basically there is a list of spot blocks , containing spots . Each spot has XY coordinates ( col , row ). The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory.

The kernel is called with a fixed number of warps. A CUDA block corresponds to a group of spot blocks . (here 3) These are called the local spot blocks.

First it takes the spots from the block’s 3 spot blocks, and copies them into shared memory ( localSpots[] ). For this it uses one warp for each spot block, so that the spots can be read coalesced. Each thread in the warp is a spot in the local spot block. The spot block indices are here hardcoded ( blocks[] ).

Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks . The surrounding spot block’s indices are also hardcoded here ( sblock[] ).

In this example it only uses the first warp for this, and traverses sblocks[] iteratively. Each thread in the warp is a spot in the surrounding spot block. It also iterates through the list of all the local spots. If the thread’s spot is close enough to the local spot: It inserts it into the local spot’s list, using atomicInc to get an index.

When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.

The program is taken from a real program that is more complex/optimized, but this already has the problem. Here it also only runs one CUDA block. If I add __syncwarp() between iterations of the outer for loop, so that threads in the warp execute in lock-step, the problem disappears. Also using a loop with atomicCAS instead of atomicInc seems to solve it.


It produces this output:

00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)

For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling atomicInc_block on the same counter (at 0x00000221E40003E0 ).


Is it needed to add __threadfence() calls, and/or to make the counter variable volatile, or is there another bug in this program? According the the CUDA programming guide, the atomic functions do not imply memory ordering constraints.

As I suggested here, my recommendation is to file a bug. The instructions are linked to sticky post at the top of this sub-forum.

Filed a bug report: https://developer.nvidia.com/nvidia_bug/3312829