Use of Atomic Functions Still Leads to Race

Hello Everyone,

I am creating a fast computational acoustics code using CUDA. Due to the possibility of potential multiple accesses to the same memory location in different threads, I decided to use the atomic add function. However, it seems to me that even thought I used the atomic add function, race still happens. Could anyone please explain to me how this happens in general? I can follow up with the code if a general explanation is not adequate. Thanks!

Best,
Ziqi