I’m trying to understand how various flavours of atomics (add) are implemented at the sass level so I’ve written a few code samples. I’m mostly interested in compute architectures 8.6 and 8.9.
From my tests, it seems that float and int atomic adds are handled with a single dedicated RED instruction. For instance:
global void global_atomicAdd_float(float* restrict a) {
const float value = 1.0f;
atomicAdd(a, value);
}
compiles to the following:
global_atomicAdd_float(float*):
IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
IMAD.MOV.U32 R5, RZ, RZ, 0x3f800000
MOV R2, c[0x0][0x160]
ULDC.64 UR4, c[0x0][0x118]
MOV R3, c[0x0][0x164]
RED.E.ADD.F32.FTZ.RN.STRONG.GPU [R2.64], R5
EXIT
I observe very similar sass code for ints, doubles and uint64s, with various overloads of the RED instruction. However, things get much weirder with the half2 type:
global void global_atomicAdd_half2(__half2* restrict a) {
const __half2 value = __float2half2_rn(1.0f);
atomicAdd(a, value);
}
Which gets compiled to the following:
global_atomicAdd_half2(__half2*):
IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
MOV R5, 0x3c003c00
IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x160]
MOV R3, c[0x0][0x164]
ULDC.64 UR4, c[0x0][0x118]
ATOM.E.ADD.F16x2.RN.STRONG.GPU P0, RZ, [R2.64], R5
@P0 EXIT
QSPC.E.S P0, RZ, [R2]
@!P0 BRA (.L_x_4)
ULDC UR4, c[0x0][0x160]
ULOP3.LUT UR4, UR4, 0xffffff, URZ, 0xc0, !UPT
.L_x_5:
LDS R2, [UR4]
IMAD.U32 R0, RZ, RZ, UR4
HADD2 R3, R2, 1, 1
ATOMS.CAST.SPIN R3, [R0], R2, R3
ISETP.EQ.U32.AND P0, PT, R3, 0x1, PT
@!P0 BRA (.L_x_5)
EXIT
.L_x_4:
LD.E R0, [R2.64]
IADD3 R5, R0, 0x3c003c00, RZ
ST.E [R2.64], R5
EXIT
As we can see, the RED instruction is replaced with an ATOM instruction which returns a predicate P0.
Based on the predicate, the threads will exit immediately or perform the addition with a compare and swap idiom in shared memory. I’m guessing that the QSPC.E.S instruction tests if the pointer is in shared memory or not. But it’s weird to do so because the pointer is a global memory pointer since it’s a kernel argument. Running the kernel with nsight compute shows that none of the instructions after the first EXIT are executed so the added code is entirely useless.
I also tried using “__builtin_assume(__isGlobal(a));” before the atomic add but it has no effect on the generated code.
What’s going on? Is it a bug or is it expected behavior? Is there any way to tell the compiler to remove the extraneous code?