I have a CUDA program where different threads (of the same block) call
atomicInc on the same counter variable, and sometimes get the same result value.
This is a full program (main.cu) that shows the problem:
Must be compiled with -DCMAKE_BUILD_TYPE=Release.
Basically there is a list of spot blocks , containing spots . Each spot has XY coordinates ( col , row ). The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory.
The kernel is called with a fixed number of warps. A CUDA block corresponds to a group of spot blocks . (here 3) These are called the local spot blocks.
First it takes the spots from the block’s 3 spot blocks, and copies them into shared memory (
localSpots ). For this it uses one warp for each spot block, so that the spots can be read coalesced. Each thread in the warp is a spot in the local spot block. The spot block indices are here hardcoded (
Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks . The surrounding spot block’s indices are also hardcoded here (
In this example it only uses the first warp for this, and traverses
sblocks iteratively. Each thread in the warp is a spot in the surrounding spot block. It also iterates through the list of all the local spots. If the thread’s spot is close enough to the local spot: It inserts it into the local spot’s list, using
atomicInc to get an index.
When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.
The program is taken from a real program that is more complex/optimized, but this already has the problem. Here it also only runs one CUDA block. If I add
__syncwarp() between iterations of the outer for loop, so that threads in the warp execute in lock-step, the problem disappears. Also using a loop with
atomicCAS instead of
atomicInc seems to solve it.
It produces this output:
00(0:00000221E40003E0) 01(2:00000221E40003E0) 02(7:00000221E40003E0) 03(1:00000221E40003E0) 03(2:00000221E40003E0) 04(3:00000221E40003E0) 04(1:00000221E40003E0) 05(4:00000221E40003E0) 06(6:00000221E40003E0) 07(2:00000221E40003E0) 08(3:00000221E40003E0) 09(6:00000221E40003E0) 10(3:00000221E40003E0) 11(5:00000221E40003E0) 12(0:00000221E40003E0) 13(1:00000221E40003E0) 14(3:00000221E40003E0) 15(1:00000221E40003E0) 16(0:00000221E40003E0) 17(3:00000221E40003E0) 18(0:00000221E40003E0) 19(2:00000221E40003E0) 20(4:00000221E40003E0) 21(4:00000221E40003E0) 22(1:00000221E40003E0)
For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling
atomicInc_block on the same counter (at
Is it needed to add
__threadfence() calls, and/or to make the counter variable
volatile, or is there another bug in this program? According the the CUDA programming guide, the atomic functions do not imply memory ordering constraints.