Wondering if someone has already timed the sum reduction using the ‘classic’ method presented in nVidia examples through shared memory vs. reducing within warps using shuffle commands, then transferring each warp’s partial sum through shared memory to one warp and reducing again using shuffle to one value. Thought nVidia might have done up a study to demonstrate this as it’s a good example to compare the new shfl operator.
My quick calculation shows that a reduction of N values in a block requires N*3 reads/writes from shared memory (I usually use 256 threads/block for compute <2.0 and 512/block otherwise). With the shared memory only method, warps gets removed pretty quickly, but there’s a lot of synchronization needed. By comparison, the shuffle will keep every warp alive (with decreasing utilization per warp) until they’re down to one value and have transferred that to the first warp. Then all the warps die except for the first (or whichever one you choose to keep alive). I’m guessing the shuffle will win in the end given all those shared memory transfers needed along with the sync’s.
I’ll give it a whirl and compare the two unless I hear someone else has already done it.