Why are Atomics discouraged?

CudaaduC · April 27, 2013, 2:06am

For things such as reductions and scans, using Atomics (on the K20 at least) seem to perform as well as the good shared memory implementations(avoiding bank conflicts, sequential addressing, unrolling last warps etc).

I am not using a single value(location) which is updated by the atomic operation, rather a value in global memory which is specific to the thread block. I then use another reduction/scan to get the best of that bunch.

Sometimes a little better, sometimes a little worse, but overall much better than advertised. Also it is much easier to code using atomics which reduces code bloat.

Why is the use of atomics so discouraged when performance (on very large data sets n> (2^28)) tends to be good?
Is this specific to the Kepler class?

seibert · April 27, 2013, 2:28am

In part, yes.

The discouragement of atomics is somewhat outdated now thanks to evolution of the architecture. When atomics were first introduced in compute capability 1.1 hardware, they were really slow. Parallel reductions were almost always a better choice in those days.

With the addition of the L2 cache in Fermi, atomic operations could be serviced directly on the chip without necessarily having to access device memory, and this brought a factor of ~10 improvement in performance. Kepler has again improved the atomics implementation, giving another factor of 7 or so in performance in some cases. I use atomics now for all my histogram implementations because it is so much more readable than the other implementations, and it runs pretty fast.

Atomics will not perform quite as well if the words being atomically updated do not all fit into the L2 cache. I don’t have any experience with this case, as I generally work with histograms that are only tens of kB.

allanmac · April 27, 2013, 2:41am

I’ll second Seibert’s reply.

Based on the presentation I saw at GTC 2013, atomics perform very well on Fermi and Kepler.

Here are links to the session “S3101 - Understanding and Using Atomic Memory Operations”:

View PDF
View Recording

CudaaduC · April 27, 2013, 3:19am

Thanks, for the info.

Many of the popular CUDA papers/documentation/books are from a few years back, and take a harder line against atomics. People just starting out with GPU programming might not understand that they may be a good option with the more recent GPUs.

Seibert is right about the code readability, as Atomic use can save 10-30 lines of code per kernel (especially if you are unrolling loops etc).

Glad to see such features and improvements in the newer Nvidia GPUs, looking forward to the next generation.

Gregory_Diamos · April 27, 2013, 10:05am

I think that a lot of comments that discourage the use of atomics are really targeted
at naive uses of atomics rather than atomics in general.

I typically do not like explaining them to new programmers, because
it isn’t simple to explain why something like this:

__device__ int reductionValue;

__global__void reduction(int* array)
{
   atomicAdd(reductionValue, array[threadIdx.x]);
}

Isn’t actually a parallel program.

Atomics can be an effective tool if you understand their limitations, but you have to
be careful about introducing them into a parallel program because they do specify operations
that will be performed serially.

tera · April 27, 2013, 11:06am

One reason I tend to not use atomic operations is that I prefer programs that give bit-by-bit identical results each time they are run, and atomic operations by their very nature cannot guarantee that.

eelsen · April 27, 2013, 6:02pm

That depends on what you’re doing with atomics. If you’re using them to compute a maximum, you should get bit-by-bit identical results on any datatype. If you’re doing addition, then it should still be ok for integers.

You only run into problems with non-associative operations - mainly floating point addition.

Uncle_Joe · April 29, 2013, 11:06am

In my experience, atomics are still much slower than a proper tree reduction. Why? because of the collisions and instruction overhead. I looked at the Kepler assembly code for an atomic add and found out it’s implemented with a do-while loop consisting of 4 instructions:

/*0188*/     /*0x00b31c85c4000000*/ 	LDSLK P0, R12, [R11];
/*0190*/     /*0x04c300034800c000*/ 	@P0 IADD R12, R12, 0x1;
/*0198*/     /*0x00b30085cc000000*/ 	@P0 STSUL [R11], R12;
/*01a0*/     /*0x800021e74003ffff*/ 	@!P0 BRA 0x188;

If another thread modifies the same memory location that this thread is trying to update between the linked load and conditional store, then this thread will have to execute the 4 instructions over again. So if every thread in a warp try to update the same location, 1 unlucky thread would need to execute the loop 32 times, and maybe slow down the progress of the finished threads (interesting scheduling problem - do you want to prioritize the finished threads, or the unlucky thread(s)? ).

According to the Kepler data sheet, the throughput for an atomic operation is 64 / cycle for the entire chip when there are no conflicts, which is much lower than the shared memory bandwidth. This figure seems consistent with the 4 assembly instructions used because the throughput of a load/store instruction is 32 per SM, multiplied by 8 SMs, and divided by 4 instructions to execute = 64.

Does anyone know if it’s practical to implement a tree reduction in hardware so that you can handle multiple atomic updates to the same location in 1 clock cycle?

seibert · April 29, 2013, 11:43am

This is a good point. The applicability of atomics depends on the overall rate of atomic updates and the probability of collisions. Newer hardware has broadened the range of applicability, but atomics should not be used without appreciating the limitations. That’s probably the best reason to teach beginning CUDA programmers about tree reductions.

Topic		Replies	Views
Taking apart global atomics performance performance, graphs, theories CUDA Programming and Performance	23	7999	January 28, 2012
cuda by example Atomics CUDA Programming and Performance	1	500	December 17, 2019
Can I avoid using AtomicAdd with this kernel ??? CUDA Programming and Performance	9	3740	January 26, 2015
Performance of Atomic operations CUDA Programming and Performance	2	2769	December 17, 2008
Atomic Operations CUDA Programming and Performance	4	4740	November 11, 2015
Fermi atomic op 10 times slower than ATI GPU? CUDA Programming and Performance	4	10063	July 25, 2011
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	9228	August 3, 2017
Performance of fire-and-forget atomics vs non-atomic writes CUDA Programming and Performance	0	869	August 8, 2017
need someone with compute capability 1.2 CUDA Programming and Performance	8	1153	June 14, 2011
Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,. CUDA Programming and Performance	53	15443	March 24, 2018

Why are Atomics discouraged?

Related topics