atomicAdd and concurrent kernels


I have 2 kernels computing something and adding their results into the same array in global memory. They both use atomicAdd to do so. Now if those kernels are executed concurrently using 2 streams, the result is quite different (we talk about a magnitude of 10^-2 or 10^-3 here, which is quite high compared to float precision).

The programming guide cleary states “The operation is atomic in the sense that it is guaranteed to be
performed without interference from other threads.”.

Is it guaranteed to work betweek kernels too? I would think so but if not, that could explain the different results I get.

Thanks for your help.

One likely issue is the undefined order in which a sum computed by atomic addition of floating-point numbers is evaluated. Floating-point addition is not associative, so a different order in the summation can lead to different results. How much of a difference in the final result can occur due to differences in evaluation order is a function of sign and magnitude of the individual summands.

The code may have other issues, but without a buildable repro case it is impossible to tell.

Yes, it exactly what I wanted to know. I would like to know if it was simply due to the different order in which computations occur, or if because of the current limitations of hardware atomicAdd were not atomic between concurrent kernels.

By setting the environment variable CUDA_LAUNCH_BLOCKING, one can disable all kernel concurrency. If I compare different runs in this configuration, I do have differences too, the one you are talking about (due to the order of the additions). But the difference is small, in order of 10^-6 or 10^-7.

If I allow my 2 kernels to be executed concurrently, the difference is becoming larger, which leads me to believe that not all additions are carried out, that is, the atomicAdd is not atomic between concurrent kernels. But I’m not 100% sure so I was wondering if it was a known limitation.

By the way I’m on Linux, using CUDA 5.5 and a a GTX Titan (compiling for compute capability 3.5).

(Edit: Posted before reading the previous response. This may be irrelevant for this problem, but interesting in general.)

Due to the nature of floating point, fractional errors usually grow predictably for multiplication and division, but not for addition and subtraction. It is quite easy to generate large errors with floating point addition when adding numbers of very different magnitudes.

I coded up a quick example in Python to demonstrate here:

As a sanity check, can you change your global array into an integer array and use the integer atomicAdd() to count how many threads update the element? If that value changes between your two scenarios, then there is something fundamentally wrong.

Thanks for your example in Python. The thing is my 2 kernels both add a contribution in the same order of magnitude.

But I did a test and made my 2 kernels add 1.0f, and I compare the results with and without concurrent execution. The result is identical so I guess the atomicAdd remains atomic between kernels. Which would make sense, I guess it does not matter if a warp comes from one kernel or another.

But what bothers me is that with CUDA_LAUNCH_BLOCKING=1, I obtain reproducible results in a given integration test. Up to 10^-6 at least. Every run, this given figure does not move. But if I enable concurrent kernels, not only the results vary between each run now, but there’s a difference in order of 10^-3 compared to the result without concurrent kernels. So there’s probably something wrong. I’ll keep looking.

Thanks njuffa and seibert for your help.