atomicAdd and concurrent kernels

Morph208 · August 5, 2013, 8:34pm

Hi,

I have 2 kernels computing something and adding their results into the same array in global memory. They both use atomicAdd to do so. Now if those kernels are executed concurrently using 2 streams, the result is quite different (we talk about a magnitude of 10^-2 or 10^-3 here, which is quite high compared to float precision).

The programming guide cleary states “The operation is atomic in the sense that it is guaranteed to be
performed without interference from other threads.”.

Is it guaranteed to work betweek kernels too? I would think so but if not, that could explain the different results I get.

Thanks for your help.

njuffa · August 5, 2013, 11:20pm

One likely issue is the undefined order in which a sum computed by atomic addition of floating-point numbers is evaluated. Floating-point addition is not associative, so a different order in the summation can lead to different results. How much of a difference in the final result can occur due to differences in evaluation order is a function of sign and magnitude of the individual summands.

The code may have other issues, but without a buildable repro case it is impossible to tell.

Morph208 · August 6, 2013, 1:48pm

Yes, it exactly what I wanted to know. I would like to know if it was simply due to the different order in which computations occur, or if because of the current limitations of hardware atomicAdd were not atomic between concurrent kernels.

By setting the environment variable CUDA_LAUNCH_BLOCKING, one can disable all kernel concurrency. If I compare different runs in this configuration, I do have differences too, the one you are talking about (due to the order of the additions). But the difference is small, in order of 10^-6 or 10^-7.

If I allow my 2 kernels to be executed concurrently, the difference is becoming larger, which leads me to believe that not all additions are carried out, that is, the atomicAdd is not atomic between concurrent kernels. But I’m not 100% sure so I was wondering if it was a known limitation.

By the way I’m on Linux, using CUDA 5.5 and a a GTX Titan (compiling for compute capability 3.5).

seibert · August 6, 2013, 2:40pm

(Edit: Posted before reading the previous response. This may be irrelevant for this problem, but interesting in general.)

Due to the nature of floating point, fractional errors usually grow predictably for multiplication and division, but not for addition and subtraction. It is quite easy to generate large errors with floating point addition when adding numbers of very different magnitudes.

I coded up a quick example in Python to demonstrate here:

[url]https://www.wakari.io/sharing/bundle/seibert/Float%20Precision[/url]

seibert · August 6, 2013, 2:45pm

As a sanity check, can you change your global array into an integer array and use the integer atomicAdd() to count how many threads update the element? If that value changes between your two scenarios, then there is something fundamentally wrong.

Morph208 · August 6, 2013, 4:04pm

Thanks for your example in Python. The thing is my 2 kernels both add a contribution in the same order of magnitude.

But I did a test and made my 2 kernels add 1.0f, and I compare the results with and without concurrent execution. The result is identical so I guess the atomicAdd remains atomic between kernels. Which would make sense, I guess it does not matter if a warp comes from one kernel or another.

But what bothers me is that with CUDA_LAUNCH_BLOCKING=1, I obtain reproducible results in a given integration test. Up to 10^-6 at least. Every run, this given figure does not move. But if I enable concurrent kernels, not only the results vary between each run now, but there’s a difference in order of 10^-3 compared to the result without concurrent kernels. So there’s probably something wrong. I’ll keep looking.

Thanks njuffa and seibert for your help.

Topic		Replies	Views
AtomicAdd result incorrect CUDA Programming and Performance	3	1670	December 29, 2018
CUDA dot product atomics problem CUDA Programming and Performance	4	1916	February 26, 2012
atomicAdd, atomicExch and atomicCAS give random results CUDA Programming and Performance	1	2624	January 28, 2011
Atomic float operations. especially add CUDA Programming and Performance	10	16499	July 31, 2009
atomicAdd() during loop not work well but at end work well CUDA Programming and Performance	3	1230	May 20, 2010
Adding data from multiple threads CUDA Programming and Performance	3	3373	June 20, 2008
Get different results for every running with atomicAdd() CUDA Programming and Performance	2	405	October 3, 2022
Possible problem with atomic on global memory CUDA Programming and Performance	8	1336	November 20, 2013
AtomicAdd faster than coalesced add. What is going on? GTX 275, consistently reproduceable CUDA Programming and Performance	2	1873	November 22, 2009
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56919	July 29, 2010

atomicAdd and concurrent kernels

Related topics