Why are Atomics discouraged?

For things such as reductions and scans, using Atomics (on the K20 at least) seem to perform as well as the good shared memory implementations(avoiding bank conflicts, sequential addressing, unrolling last warps etc).

I am not using a single value(location) which is updated by the atomic operation, rather a value in global memory which is specific to the thread block. I then use another reduction/scan to get the best of that bunch.

Sometimes a little better, sometimes a little worse, but overall much better than advertised. Also it is much easier to code using atomics which reduces code bloat.

Why is the use of atomics so discouraged when performance (on very large data sets n> (2^28)) tends to be good?
Is this specific to the Kepler class?

In part, yes.

The discouragement of atomics is somewhat outdated now thanks to evolution of the architecture. When atomics were first introduced in compute capability 1.1 hardware, they were really slow. Parallel reductions were almost always a better choice in those days.

With the addition of the L2 cache in Fermi, atomic operations could be serviced directly on the chip without necessarily having to access device memory, and this brought a factor of ~10 improvement in performance. Kepler has again improved the atomics implementation, giving another factor of 7 or so in performance in some cases. I use atomics now for all my histogram implementations because it is so much more readable than the other implementations, and it runs pretty fast.

Atomics will not perform quite as well if the words being atomically updated do not all fit into the L2 cache. I don’t have any experience with this case, as I generally work with histograms that are only tens of kB.

I’ll second Seibert’s reply.

Based on the presentation I saw at GTC 2013, atomics perform very well on Fermi and Kepler.

Here are links to the session “S3101 - Understanding and Using Atomic Memory Operations”:

View PDF
View Recording

Thanks, for the info.

Many of the popular CUDA papers/documentation/books are from a few years back, and take a harder line against atomics. People just starting out with GPU programming might not understand that they may be a good option with the more recent GPUs.

Seibert is right about the code readability, as Atomic use can save 10-30 lines of code per kernel (especially if you are unrolling loops etc).

Glad to see such features and improvements in the newer Nvidia GPUs, looking forward to the next generation.

I think that a lot of comments that discourage the use of atomics are really targeted
at naive uses of atomics rather than atomics in general.

I typically do not like explaining them to new programmers, because
it isn’t simple to explain why something like this:

__device__ int reductionValue;

__global__void reduction(int* array)
{
   atomicAdd(reductionValue, array[threadIdx.x]);
}

Isn’t actually a parallel program.

Atomics can be an effective tool if you understand their limitations, but you have to
be careful about introducing them into a parallel program because they do specify operations
that will be performed serially.

One reason I tend to not use atomic operations is that I prefer programs that give bit-by-bit identical results each time they are run, and atomic operations by their very nature cannot guarantee that.

That depends on what you’re doing with atomics. If you’re using them to compute a maximum, you should get bit-by-bit identical results on any datatype. If you’re doing addition, then it should still be ok for integers.

You only run into problems with non-associative operations - mainly floating point addition.

In my experience, atomics are still much slower than a proper tree reduction. Why? because of the collisions and instruction overhead. I looked at the Kepler assembly code for an atomic add and found out it’s implemented with a do-while loop consisting of 4 instructions:

/*0188*/     /*0x00b31c85c4000000*/ 	LDSLK P0, R12, [R11];
/*0190*/     /*0x04c300034800c000*/ 	@P0 IADD R12, R12, 0x1;
/*0198*/     /*0x00b30085cc000000*/ 	@P0 STSUL [R11], R12;
/*01a0*/     /*0x800021e74003ffff*/ 	@!P0 BRA 0x188;

If another thread modifies the same memory location that this thread is trying to update between the linked load and conditional store, then this thread will have to execute the 4 instructions over again. So if every thread in a warp try to update the same location, 1 unlucky thread would need to execute the loop 32 times, and maybe slow down the progress of the finished threads (interesting scheduling problem - do you want to prioritize the finished threads, or the unlucky thread(s)? ).

According to the Kepler data sheet, the throughput for an atomic operation is 64 / cycle for the entire chip when there are no conflicts, which is much lower than the shared memory bandwidth. This figure seems consistent with the 4 assembly instructions used because the throughput of a load/store instruction is 32 per SM, multiplied by 8 SMs, and divided by 4 instructions to execute = 64.

Does anyone know if it’s practical to implement a tree reduction in hardware so that you can handle multiple atomic updates to the same location in 1 clock cycle?

This is a good point. The applicability of atomics depends on the overall rate of atomic updates and the probability of collisions. Newer hardware has broadened the range of applicability, but atomics should not be used without appreciating the limitations. That’s probably the best reason to teach beginning CUDA programmers about tree reductions.