atomicExch faster than gmem assign

norman.pellet · March 30, 2020, 4:47pm

Hi !

I know this question can be found in several places (most notably Why does the atomicAdd work faster than â€˜+=â€™? - #3 by Alexander07K) but I’d really like to understand this better.

My kernel (occupancy of 32 warp / SM, compiled with compute_61, RTX 2070), looks like this:

template<typename T, bool exch>
    void __global__ myKernel(T* target, ...)
     // Calculate index somehow
    const size_t = // calculate index
    const float repValue = // calculate value
    if (exch) {
       atomicExch(target + index, repValue);
    } else {
      target[index] = repValue;
    }
}

Now, there’s a 33% performance increase of the whole kernel when using atomicExch.

The PTX looks like one might expect, if I understand well

target[index] = repValue;
0x0000028f36796990               MOV R6, R4  
0x0000028f367969a0               MOV R7, R5  
0x0000028f367969b0               SHF.L.U64.HI R7, R6, 0x2, R7  
0x0000028f367969c0               SHF.L.U32 R6, R6, 0x2, RZ  
0x0000028f367969d0               IADD3 R6, P0, R18, R6, RZ  
0x0000028f367969e0               IADD3.X R7, R17, R7, RZ, P0, !PT  
0x0000028f367969f0               MOV R6, R6  
0x0000028f36796a00               MOV R7, R7  
0x0000028f36796a10               MOV R6, R6  
0x0000028f36796a20               MOV R7, R7  
0x0000028f36796a30               ST.E.SYS [R6], R25

Or, for the other version of the function

atomicExch(target + index, repValue);
0x000001ee187a7710               MOV R4, R44  
0x000001ee187a7720               MOV R5, R45  
0x000001ee187a7730               SHF.L.U64.HI R5, R4, 0x2, R5  
0x000001ee187a7740               SHF.L.U32 R4, R4, 0x2, RZ  
0x000001ee187a7750               IADD3 R4, P0, R18, R4, RZ  
0x000001ee187a7760               IADD3.X R5, R17, R5, RZ, P0, !PT  
0x000001ee187a7770               MOV R4, R4  
0x000001ee187a7780               MOV R5, R5  
0x000001ee187a7790               MOV R6, R25  
0x000001ee187a77a0               MOV R20, 0x0  
0x000001ee187a77b0               MOV R21, 0x0  
0x000001ee187a77c0               CALL.ABS.NOINC 0x0

Is the explanation provided by Tegra in the other thread really the whole story ? Since repValue is in a local register, I’m really having trouble to understand how it can be so much faster…
(Just to make it clear, dead code elimination is properly performed by nvcc)

Topic		Replies	Views
atomic read or write CUDA Programming and Performance	3	4176	July 15, 2009
atomic add operation CUDA Programming and Performance	2	4388	July 22, 2014
Poor atomicCAS performance on shared memory CUDA Programming and Performance	2	639	December 5, 2019
How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump CUDA Programming and Performance	7	5193	May 7, 2012
Why does the atomicAdd work faster than â€˜+=â€™? CUDA Programming and Performance	3	12189	November 2, 2011
Question about AtomicExch() CUDA Programming and Performance	2	4031	March 31, 2013
atomicAdd & atomicExch in a particle simulation CUDA Programming and Performance	3	10201	April 21, 2011
Atomic Operations CUDA Programming and Performance	4	4598	November 11, 2015
Strange Performance Problem using atomicAdd CUDA Programming and Performance	4	1605	September 12, 2010
Is atomicExch() safe for incremental a global float array? CUDA Programming and Performance	13	18289	July 6, 2009

atomicExch faster than gmem assign

Related topics