I am looking to use warp-aggregate operations on a simple global summation using CUDA. This is purely a learning exercise and I am not sure if the performance would be worth it but here it goes.
Can anyone point me to an example of a reduction using CUDA such that a reduction is employed with warp-level granularity? Therefore any atomicAdd would be used for a single thread in each warp rather than every thread in every warp. Can this be done?
Yes - look e.g. at the second example (“Shuffle Warp Reduce”) at this Parallel Forall blog entry.
You can also move the data via shared memory instead if your GPU does not support the shuffle intrinsics.
A pretty good introduction to basic parallel reduction techniques (not using atomics) is here:
(including shared memory warp level)
In addition, the programming guide discusses warp reductions using shfl instructions.
And there is the blog that tera pointed out.
Thanks for the info. I will give the shfl instructions a try and only call atomics on a single thread per warp.