Has CUDA made any optimizations for atomic functions? Based on analysis using Nsight Compute in scenarios without thread contention, it was found that atomicAdd
outperforms +=
in both compute throughput and memory throughput. Additionally, the percentage of uncoalesced global accesses is lower for atomicAdd
(20%) compared to +=
(30%).
Is this normal? And here’s my code.
while (ray_id < *_numRayAzim_GPU && group_id < *_numGroup_GPU) {
for (int seg_id = _sumNumSegForRayAzim_GPU[ray_id]; seg_id < _sumNumSegForRayAzim_GPU[ray_id+1]; ++seg_id) {
seg_id_global_azim = _segIDGlobalAzim_GPU[seg_id];
for (int polar_id = 0; polar_id < *_numPolar_GPU; ++polar_id) {
index = group_id * P_S + polar_id * (*_numSegForAzim_GPU) + seg_id_global_azim;
_fsrAveAngFlux_TEMP_GPU[index] += _delta;
// atomicAdd(&_fsrAveAngFlux_TEMP_GPU[index], _delta);
}
}
tid += blockDim.x * gridDim.x;
ray_id = (int)tid / (*_numGroup_GPU);
group_id = (int)tid % (*_numGroup_GPU);
}