CUDA - atomicAdd Efficiency issues

Has CUDA made any optimizations for atomic functions? Based on analysis using Nsight Compute in scenarios without thread contention, it was found that atomicAdd outperforms += in both compute throughput and memory throughput. Additionally, the percentage of uncoalesced global accesses is lower for atomicAdd (20%) compared to += (30%).
Is this normal? And here’s my code.

while (ray_id < *_numRayAzim_GPU && group_id < *_numGroup_GPU) {
        for (int seg_id = _sumNumSegForRayAzim_GPU[ray_id]; seg_id < _sumNumSegForRayAzim_GPU[ray_id+1]; ++seg_id) {
            seg_id_global_azim = _segIDGlobalAzim_GPU[seg_id];

            for (int polar_id = 0; polar_id < *_numPolar_GPU; ++polar_id) {
                index = group_id * P_S + polar_id * (*_numSegForAzim_GPU) + seg_id_global_azim;       
                _fsrAveAngFlux_TEMP_GPU[index] += _delta;
                // atomicAdd(&_fsrAveAngFlux_TEMP_GPU[index], _delta);
            }
        }

        tid += blockDim.x * gridDim.x;
        ray_id = (int)tid / (*_numGroup_GPU);
        group_id = (int)tid % (*_numGroup_GPU);
    }

AFAIK there were improvements being done.

If atomicAdd is processed in some memory or caching layer instead of the SM, then data does not have to be returned, and the program can just continue.

From the CUDA Programming Guide:

Regardless of the setting of the compiler flag -ftz,
▶ atomic single-precision floating-point adds on global memory always operate in flush-to-zero mode, i.e., behave equivalent to FADD.F32.FTZ.RN,
▶ atomic single-precision floating-point adds on shared memory always operate with denormal support, i.e., behave equivalent to FADD.F32.RN

This functional difference suggests to me that at least some variants of atomicAdd are being performed by adders in the memory hierarchy (e.g. inside the memory controllers), rather than using the adders present in the SM.

2 Likes