CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

jwitsoe · October 2, 2014, 5:57am

Originally published at: CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics | NVIDIA Technical Blog

Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort. In fact, the code generated by the compiler is actually faster than the manually-written warp aggregation code.…

anon55393588 · October 15, 2014, 3:39pm

If 64-bit integer atomics are used, but the increment value is the same across all threads in a warp, then you don't really need warp reduction: you can use the same trick with __popc(), and do this part in 32 bits. The performance with 64-bit atomics in global case (both non-aggregated and aggregated) is very similar (roughly, no more than 1% different) to 32-bit case.

If double-precision atomics must be used globally, but increment value is the same, warp reduction is again unnecessary. However, you will have to model atomicAdd() through compare-and-swap (e.g., like here: http://docs.nvidia.com/cuda... ), and its performance will be low, due to high cost of each conflict. In the example from the post, for 50% values passing the filter on K40, it's 0.001 GiB/s and 0.108 GiB/s for global non-aggregated and aggregated cases, respectively. Obviously, aggregation improves performance by a factor of 100, but absolute numbers are very low compared to using atomics supported in hardware. Shared memory version in this case is faster: it achieves 0.637 GiB/s and 1.168 GiB/s for non-aggregated and aggregated cases, respectively.

If increment values are different, and you need to do a reduction across the warp, then there will be additional overhead. I expect it, however, to be relatively low in the case of 64-bit floating-point atomics when compared to the costs of atomics themselves. I have no specific performance numbers here, and believe that this is well beyond the scope of this blog post. An example of warp-level reduction for 32-bit integers is available here: http://docs.nvidia.com/cuda... , and it can be easily extended to 64-bit case, though you'll need two shuffle instructions to exchange 64-bit values between threads in the same warp.

anon49123332 · August 3, 2015, 4:28pm

What is the license of the provided source code?

anon88059166 · November 4, 2015, 12:23pm

Andrew, many thanks for this algorithm, I was coding something like it (using ballot and popc), and you saved me many hours of work.

I've read some old (2009) papers attempting to use this approach before ballot came to CUDA. Can you name me some paper related to your algorithm?

anon81115702 · September 10, 2017, 10:09pm

We made it for OpenGL.

anon17555222 · July 17, 2018, 8:50pm

Well, in the graphs you posted, why I feel everything is slowing down when cooperative group technique is implemented..? On the contrast, shared memory seems to be the best approach?

anon192600 · January 19, 2020, 4:32am

The sample code in the Performance Comparison, 'nres' is undefined. It should be passed in the function argument.

AndrewWenIC · May 24, 2021, 3:28am

I am using Jetson Xavier, which is Volta Arch to test the perfromance of atomicAdd() to shared vs. global as you descirebed in blog. I found the performance of adding in shared in even 10% slower than global. Any possible reasons?

kayccc · May 29, 2021, 1:24am

Hi wen14211124,

Not sure if this is Jetson AGX Xavier related issue, but we recommend you to raise it to the respective platform from the below link
Latest Jetson & Embedded Systems/Jetson AGX Xavier topics - NVIDIA Developer Forums
Thanks!

Topic		Replies	Views
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12397	February 7, 2023
Using CUDA Warp-Level Primitives Technical Blog	20	1910	April 15, 2024
Faster Parallel Reductions on Kepler Technical Blog	53	1833	September 4, 2021
Register Cache: Caching for Warp-Centric CUDA Programs Technical Blog	3	616	January 31, 2024
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8853	August 3, 2017
Voting and Shuffling to Optimize Atomic Operations Technical Blog	0	362	August 25, 2020
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	8781	September 30, 2022
Efficient use of shared memory CUDA Programming and Performance	29	4238	December 2, 2019
Global thread barrier CUDA Programming and Performance	78	85553	December 23, 2011

CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

Related topics