I have a CUDA program where different threads (of the same block) call atomicInc
on the same counter variable, and sometimes get the same result value.
This is a full program (main.cu) that shows the problem:
The CMakeLists.txt:
Must be compiled with -DCMAKE_BUILD_TYPE=Release.
Basically there is a list of spot blocks , containing spots . Each spot has XY coordinates ( col , row ). The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory.
The kernel is called with a fixed number of warps. A CUDA block corresponds to a group of spot blocks . (here 3) These are called the local spot blocks.
First it takes the spots from the block’s 3 spot blocks, and copies them into shared memory ( localSpots[]
). For this it uses one warp for each spot block, so that the spots can be read coalesced. Each thread in the warp is a spot in the local spot block. The spot block indices are here hardcoded ( blocks[]
).
Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks . The surrounding spot block’s indices are also hardcoded here ( sblock[]
).
In this example it only uses the first warp for this, and traverses sblocks[]
iteratively. Each thread in the warp is a spot in the surrounding spot block. It also iterates through the list of all the local spots. If the thread’s spot is close enough to the local spot: It inserts it into the local spot’s list, using atomicInc
to get an index.
When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.
The program is taken from a real program that is more complex/optimized, but this already has the problem. Here it also only runs one CUDA block. If I add __syncwarp()
between iterations of the outer for loop, so that threads in the warp execute in lock-step, the problem disappears. Also using a loop with atomicCAS
instead of atomicInc
seems to solve it.
It produces this output:
00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)
For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling atomicInc_block
on the same counter (at 0x00000221E40003E0
).
Is it needed to add __threadfence()
calls, and/or to make the counter variable volatile
, or is there another bug in this program? According the the CUDA programming guide, the atomic functions do not imply memory ordering constraints.