Hi !
I know this question can be found in several places (most notably Why does the atomicAdd work faster than ‘+=’? - #3 by Alexander07K) but I’d really like to understand this better.
My kernel (occupancy of 32 warp / SM, compiled with compute_61, RTX 2070), looks like this:
template<typename T, bool exch>
void __global__ myKernel(T* target, ...)
// Calculate index somehow
const size_t = // calculate index
const float repValue = // calculate value
if (exch) {
atomicExch(target + index, repValue);
} else {
target[index] = repValue;
}
}
Now, there’s a 33% performance increase of the whole kernel when using atomicExch.
The PTX looks like one might expect, if I understand well
target[index] = repValue;
0x0000028f36796990 MOV R6, R4
0x0000028f367969a0 MOV R7, R5
0x0000028f367969b0 SHF.L.U64.HI R7, R6, 0x2, R7
0x0000028f367969c0 SHF.L.U32 R6, R6, 0x2, RZ
0x0000028f367969d0 IADD3 R6, P0, R18, R6, RZ
0x0000028f367969e0 IADD3.X R7, R17, R7, RZ, P0, !PT
0x0000028f367969f0 MOV R6, R6
0x0000028f36796a00 MOV R7, R7
0x0000028f36796a10 MOV R6, R6
0x0000028f36796a20 MOV R7, R7
0x0000028f36796a30 ST.E.SYS [R6], R25
Or, for the other version of the function
atomicExch(target + index, repValue);
0x000001ee187a7710 MOV R4, R44
0x000001ee187a7720 MOV R5, R45
0x000001ee187a7730 SHF.L.U64.HI R5, R4, 0x2, R5
0x000001ee187a7740 SHF.L.U32 R4, R4, 0x2, RZ
0x000001ee187a7750 IADD3 R4, P0, R18, R4, RZ
0x000001ee187a7760 IADD3.X R5, R17, R5, RZ, P0, !PT
0x000001ee187a7770 MOV R4, R4
0x000001ee187a7780 MOV R5, R5
0x000001ee187a7790 MOV R6, R25
0x000001ee187a77a0 MOV R20, 0x0
0x000001ee187a77b0 MOV R21, 0x0
0x000001ee187a77c0 CALL.ABS.NOINC 0x0
Is the explanation provided by Tegra in the other thread really the whole story ? Since repValue is in a local register, I’m really having trouble to understand how it can be so much faster…
(Just to make it clear, dead code elimination is properly performed by nvcc)