atomicExch faster than gmem assign

Hi !

I know this question can be found in several places (most notably Why does the atomicAdd work faster than ‘+=’?) but I’d really like to understand this better.

My kernel (occupancy of 32 warp / SM, compiled with compute_61, RTX 2070), looks like this:

template<typename T, bool exch>
    void __global__ myKernel(T* target, ...)
     // Calculate index somehow
    const size_t = // calculate index
    const float repValue = // calculate value
    if (exch) {
       atomicExch(target + index, repValue);
    } else {
      target[index] = repValue;

Now, there’s a 33% performance increase of the whole kernel when using atomicExch.

The PTX looks like one might expect, if I understand well

target[index] = repValue;
0x0000028f36796990               MOV R6, R4  
0x0000028f367969a0               MOV R7, R5  
0x0000028f367969b0               SHF.L.U64.HI R7, R6, 0x2, R7  
0x0000028f367969c0               SHF.L.U32 R6, R6, 0x2, RZ  
0x0000028f367969d0               IADD3 R6, P0, R18, R6, RZ  
0x0000028f367969e0               IADD3.X R7, R17, R7, RZ, P0, !PT  
0x0000028f367969f0               MOV R6, R6  
0x0000028f36796a00               MOV R7, R7  
0x0000028f36796a10               MOV R6, R6  
0x0000028f36796a20               MOV R7, R7  
0x0000028f36796a30               ST.E.SYS [R6], R25

Or, for the other version of the function

atomicExch(target + index, repValue);
0x000001ee187a7710               MOV R4, R44  
0x000001ee187a7720               MOV R5, R45  
0x000001ee187a7730               SHF.L.U64.HI R5, R4, 0x2, R5  
0x000001ee187a7740               SHF.L.U32 R4, R4, 0x2, RZ  
0x000001ee187a7750               IADD3 R4, P0, R18, R4, RZ  
0x000001ee187a7760               IADD3.X R5, R17, R5, RZ, P0, !PT  
0x000001ee187a7770               MOV R4, R4  
0x000001ee187a7780               MOV R5, R5  
0x000001ee187a7790               MOV R6, R25  
0x000001ee187a77a0               MOV R20, 0x0  
0x000001ee187a77b0               MOV R21, 0x0  
0x000001ee187a77c0               CALL.ABS.NOINC 0x0  

Is the explanation provided by Tegra in the other thread really the whole story ? Since repValue is in a local register, I’m really having trouble to understand how it can be so much faster…
(Just to make it clear, dead code elimination is properly performed by nvcc)