Reduction through shared memory vs. shuffle

Wondering if someone has already timed the sum reduction using the ‘classic’ method presented in nVidia examples through shared memory vs. reducing within warps using shuffle commands, then transferring each warp’s partial sum through shared memory to one warp and reducing again using shuffle to one value. Thought nVidia might have done up a study to demonstrate this as it’s a good example to compare the new shfl operator.

My quick calculation shows that a reduction of N values in a block requires N*3 reads/writes from shared memory (I usually use 256 threads/block for compute <2.0 and 512/block otherwise). With the shared memory only method, warps gets removed pretty quickly, but there’s a lot of synchronization needed. By comparison, the shuffle will keep every warp alive (with decreasing utilization per warp) until they’re down to one value and have transferred that to the first warp. Then all the warps die except for the first (or whichever one you choose to keep alive). I’m guessing the shuffle will win in the end given all those shared memory transfers needed along with the sync’s.

I’ll give it a whirl and compare the two unless I hear someone else has already done it.

Thanks all.

In my experience __shfl() makes a modest difference in running time of scans/reductions when compared to using the typical shared memory techniques. In my experience it speeds things up by 5-10% over the older techniques(all else being equal).

Generally you end up using less shared memory (THREADS/32), and the code ends up looking cleaner with __shfl().

What makes more of a difference in reductions/scans is filling up the SMs with work and reading from global memory in coalesced fashion.

Just saw a post in the “Parallel for All” blog by Justin Luitjens where he discusses this, but no timings vs. the shared only method. Also only looked at ints, not floats. Assume this shouldn’t matter given that they’re the same size and cycle time for an add.

Saw a talk by Justin Luitjens at GTC 2014; excellent speaker.

I saw a video about a year ago where an Nvidia employee said he saw about a 20% speedup. I don’t know how true that is since in my application I saw about 10%. It definitely did make a speedup, and uses fewer resources, so it’s kind of a no-brainer.

Just thought I’d follow up with some timing since I said I would. I’m running on a GeForce GTX 650, so a relatively small card (Compute 3.0). Below is some timing for the comparisons mentioned in the blog above (always using 1024 threads/block, single and vector (4-tuple) float loads, and using atomicAdd for reduction across blocks (or across warps):
(1) Using pure shared memory; non-vector loads
(2) Using shared memory only with 4-tuple loads
(3) Using shuffle and shared memory; atomic update per block; non-vector load
(4) Using shuffle and shared memory; atomic update per block; vector load
(3) Using shuffle only; atomic update per warp; non-vector load
(4) Using shuffle only; atomic update per warp; vector load

Depending on card, your mileage may vary…

65536 float elements
Test -> Time (usec)
1 -> 37.7
2 -> 12.3
3 -> 38.7
4 -> 13.1
5 -> 24.7
6 -> 9.0

2^21 float Elements (~# of pixels in a 1920 x 1080 HD image)
Test -> Time (usec)
1 -> 654
2 -> 368
3 -> 692
4 -> 360
5 -> 488
6 -> 274

Looks like the shuffle only version is ~20-25% faster (like shaklee3 mentioned above) than the shared only version.

Interesting that the (shared+shuffle) version is slower than the pure shared version, and I think this has something to do with the fact that the shared version drops warps quite quickly since every add halves the number of warps still participating in a block. However, the shared memory only version uses the ‘unsafe’ assumption that shared memory locations within a warp are synchronized and I don’t do syncthreads() after I get below 32 values. To make matters worse, it was actually not working correctly until I added the volatile keyword to the static shared memory declaration in the shared only version. Lesson learned.

Going to follow this up with a test that atomically sums into doubles, which I’m guessing is quite commonly needed when summing more than a million floats.

for 2^21 float elements on a K20c using using __shfl() and shared mem with no atomics or vector loads I get an average of 68 usec (averaged over 100 reps) with almost 60% of global memory bandwidth.

When I get up to 2^27 float elements (average 3058 usec) the global memory bandwidth gets to about 85%.

Those vectorized loads have never made significant positive difference on the K20c in my experience, I wonder if this makes more of a difference on the consumer cards.

How many floats are you summing per block? I’m surprised there is that much difference between the different methods. For large enough chunks of data reduction should be entirely memory bound, as CudaaduC mentioned.

As stated above, always using 1024 threads/block. I understand that this is a fairly simple kernel and should be memory I/O bound; don’t know why there’s such large variations, but the NVidia folks show the same thing.

On the ‘Parallel for All’ Blog I mention above there’s a chart that plots bandwidth as a function of reduction size (in integer elements) for several methods (mostly the same ones I measured). In that chart it’s pretty clear that saturation doesn’t occur until about 32M elements. Now that was for a much bigger card than mine (K20x), and I haven’t done an exhaustive study over size for my card, so I don’t know where on that curve I am. However, you can see right in the middle of that chart there’s quite a separation, and that’s at roughly 2M samples, the larger of the two sizes I timed and posted above (I happen to deal mostly with HD images, so that size is particularly important to me).

I’m more than happy to post my code for review, but it’s pretty much ripped from the blog with some changes to template the block size and incoming data size (vector vs. non-vector).

There doesn’t need to be a 1-to-1 relationship between threads and summands. In fact it is quite wasteful to have as many threads as numbers to add. It is much more efficient to have just enough threads to fully load the device and then within each thread sum over multiple addends just as a serial version of the code would, as this does not need any communication between threads and avoids leaving most of the threads unused as the reduction progresses. The different parallel reduction techniques for reducing all thread results to a single result per block would then only account for a tiny fraction of the entire runtime and thus make only a small difference.

I haven’t checked the parallel forall blog in particular. I just remember the early Nvidia presentations on parallel reduction based on Mark Harris’ work. They went through all the different code variants for didactical reasons so the speedup for each transformation could be recorded. Above optimization was then left as the last improvement step. Would they have performed this optimization as the first step, all others would have made little difference.