CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

Originally published at: CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics | NVIDIA Technical Blog

Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort. In fact, the code generated by the compiler is actually faster than the manually-written warp aggregation code.…

If 64-bit integer atomics are used, but the increment value is the same across all threads in a warp, then you don't really need warp reduction: you can use the same trick with __popc(), and do this part in 32 bits. The performance with 64-bit atomics in global case (both non-aggregated and aggregated) is very similar (roughly, no more than 1% different) to 32-bit case.

If double-precision atomics must be used globally, but increment value is the same, warp reduction is again unnecessary. However, you will have to model atomicAdd() through compare-and-swap (e.g., like here: http://docs.nvidia.com/cuda... ), and its performance will be low, due to high cost of each conflict. In the example from the post, for 50% values passing the filter on K40, it's 0.001 GiB/s and 0.108 GiB/s for global non-aggregated and aggregated cases, respectively. Obviously, aggregation improves performance by a factor of 100, but absolute numbers are very low compared to using atomics supported in hardware. Shared memory version in this case is faster: it achieves 0.637 GiB/s and 1.168 GiB/s for non-aggregated and aggregated cases, respectively.

If increment values are different, and you need to do a reduction across the warp, then there will be additional overhead. I expect it, however, to be relatively low in the case of 64-bit floating-point atomics when compared to the costs of atomics themselves. I have no specific performance numbers here, and believe that this is well beyond the scope of this blog post. An example of warp-level reduction for 32-bit integers is available here: http://docs.nvidia.com/cuda... , and it can be easily extended to 64-bit case, though you'll need two shuffle instructions to exchange 64-bit values between threads in the same warp.

What is the license of the provided source code?

Andrew, many thanks for this algorithm, I was coding something like it (using ballot and popc), and you saved me many hours of work.

I've read some old (2009) papers attempting to use this approach before ballot came to CUDA. Can you name me some paper related to your algorithm?

We made it for OpenGL.

Well, in the graphs you posted, why I feel everything is slowing down when cooperative group technique is implemented..? On the contrast, shared memory seems to be the best approach?

The sample code in the Performance Comparison, 'nres' is undefined. It should be passed in the function argument.

I am using Jetson Xavier, which is Volta Arch to test the perfromance of atomicAdd() to shared vs. global as you descirebed in blog. I found the performance of adding in shared in even 10% slower than global. Any possible reasons?

Hi wen14211124,

Not sure if this is Jetson AGX Xavier related issue, but we recommend you to raise it to the respective platform from the below link
Latest Jetson & Embedded Systems/Jetson AGX Xavier topics - NVIDIA Developer Forums
Thanks!