I am looking to use warp-aggregate operations on a simple global summation using CUDA. This is purely a learning exercise and I am not sure if the performance would be worth it but here it goes.

Can anyone point me to an example of a reduction using CUDA such that a reduction is employed with warp-level granularity? Therefore any atomicAdd would be used for a single thread in each warp rather than every thread in every warp. Can this be done?