Warp-Aggregate AtomicAdd

peterSteele123 · March 9, 2015, 8:50pm

Hello all,

I am looking to use warp-aggregate operations on a simple global summation using CUDA. This is purely a learning exercise and I am not sure if the performance would be worth it but here it goes.

Can anyone point me to an example of a reduction using CUDA such that a reduction is employed with warp-level granularity? Therefore any atomicAdd would be used for a single thread in each warp rather than every thread in every warp. Can this be done?

Thank you.

tera · March 9, 2015, 11:28pm

Yes - look e.g. at the second example (“Shuffle Warp Reduce”) at this Parallel Forall blog entry.

You can also move the data via shared memory instead if your GPU does not support the shuffle intrinsics.

Robert_Crovella · March 10, 2015, 3:16am

A pretty good introduction to basic parallel reduction techniques (not using atomics) is here:

[url]http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf[/url]

(including shared memory warp level)

In addition, the programming guide discusses warp reductions using shfl instructions.

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-reduction[/url]

And there is the blog that tera pointed out.

peterSteele123 · March 10, 2015, 3:23am

Thanks for the info. I will give the shfl instructions a try and only call atomics on a single thread per warp.