Faster Parallel Reductions on Kepler

jwitsoe · February 14, 2014, 4:32am

Originally published at: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/

Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL)…

anon59844113 · February 14, 2014, 9:59pm

Nice post!

A "bit twiddling" variant of multiple reductions is to encode multiple values in a single word and then implicitly perform multiple reductions in one 32-bit SHFL reduction (or scan).

For example, if you categorized the values of a warp into 5 types (or 4 + "no match") and wanted to know the total for each type you can encode this into 5 6-bit fields and run your warp reduction (or scan). Each 6-bit field will total from 0-32.

I haven't tested if this kind of reduction is much faster than 5 ballot()+popc() ops but it's close to the same instruction count (10) in isolation. But I'm pretty sure it's faster than a prefix sum scan because of the extra instructions.

I've found this warp hack useful for quickly categorizing and binning of values!

anon56323385 · February 17, 2014, 9:02pm

Great information just the right time.
I was just planning to experiment with shuffle for our code.
I used the warp reduce to do some pre-reduction in the kernel that generates the data for the global sum in addition to the pre-reduction in registers. And as the generating kernel now has less stores to global memory it is slightly faster and the following reduction now only is over an array that is 32x smaller.

Another point: I was wondering about the return in 'warpReduceSumTriple' as well as the reinterpret_cast(&var); without template arguments. I have not tried, but I believe the compiler might not like that code.

anon95180265 · February 17, 2014, 11:54pm

Great catches. The warpReduceSumTriple error was a copy/paste mistake, and the missing reinterpret_cast template argument was due to the editor stripping '<' / '>' to avoid malicious html. I have fixed both errors!

anon56323385 · February 18, 2014, 12:57am

Thanks.
In view of the increased ILP I was wondering whether you tried to use the interleaved warpReduceSum by default also for single reductions by dividing the array in multiple sections. I.e. split it into 4 parts and then use a warpReduceSumMultiple which does shuffle for int4.
Of course this will result in a smaller number of threads but eventually the increased ILP will yield a better performance.

anon95180265 · February 18, 2014, 3:00am

Yes, see the discussion of vector loads Justin made in the post (and the Pro Tip on the subject). No need to divide the array into sections -- just cast the array to an array of int2 or int4 and load one vector per thread.

anon56323385 · February 21, 2014, 8:09pm

I was playing with merging the blockReduceSum and using it together with atomics directly in a Kernel which calculates the dot_product of two vectors.

When my problem size is not a multiple of the blockSize - which happen when tuning blockSizes - the last block will have less 'active' warps then blockDim/warpSize. I do bounds checking in the Kernel with the typical idx (=blockDim.x*blockIdx.x+threadIdx.x) > number_of_elements.
Of course I can activate the missing warps and just set their values to zero. This however resulted in a somewhat increased register usage and was slower. Is there a way to make blockReduceSum safe even if not all threads in the block are actually executing it, i.e., have exited because they belong to the last block and their index is beyond the problem size?

anon99255387 · February 28, 2014, 2:58pm

Typically I will keep those threads around. I
never put an early return in a kernel. If i’m doing a reduction across the
entire block I would structure the main loop using the grid stride loops
discussed in an earlier blog post and then reduce across the block after
exiting the loop. The for loop naturally handles the bounds checking you have
mentioned. At the end of the loop all threads are still present for the
reduction.

anon99255387 · February 28, 2014, 3:00pm

In addition, the grid-stride loops also divides the array into multiple sections so that each thread processes multiple elements.

anon49466950 · March 11, 2014, 10:40am

A misprint in section "Atomic Power":
if (threadIdx.x & warpSize == 0)
should be
if (threadIdx.x % warpSize == 0)

anon95180265 · March 13, 2014, 4:47am

Fixed, thanks! Actually it's more efficient to write:

if (threadIdx.x & (warpSize - 1) == 0)

Because the compiler doesn't know that warpSize is a power of 2.

anon60980531 · April 28, 2014, 2:17pm

In the blockReduceSum() function above, is the keyword 'static' in front of the shared memory declaration really needed? nvcc throws an error on this in my setup...

Also isn't a bit-shift operation faster than the '/'? If so, why in the shuffle reduce loop do you do a warpSize/2 instead of (warpSize>>1) (or for that matter, an offset /=2 instead of offset >>=1)?

anon71854319 · December 5, 2014, 1:53pm

Amazing,
I was just realizing that I needed a fast single-kernel reduction. This allows me to easily launch up to 32 reductions, each one using a different cuda stream, so it fills the GPU utilization, and do the global sync just once at the end when all kernels have finished.

Thanks!

anon83263795 · February 18, 2015, 7:13pm

Need to discuss the shared memory bank conflict in the previous document by Mark Harris "reduction.pdf". Why there exist no shared memory bank conflicts in Reduction#1 (Interleaved Addressing) example whereas Reduction#2 suffers from shared memory bank conflict?

anon41874091 · March 21, 2015, 5:40pm

Reduction without storing array in shared memory is really great for my app!
It looks like this example of shuffle reduction works faster with this little tweak:
__inline__ __device__ int warpReduce(int val) {
#pragma unroll // for "16", not "warpSize/2"
for (int offset = 16; offset > 0; offset /= 2)
{ ... }
return val;
}

anon95180265 · March 23, 2015, 10:25pm

Yes, if you hard-code the warpSize it will definitely be faster. Unfortunately warpSize is not a compile-time constant.

anon99255387 · March 23, 2015, 11:26pm

static is optional as all shared memory allocations are static. I included it to be explicit. Older compilers did not support this keyword despite using the semantics of it. You are probably using an older compiler.

A shift is indeed faster but compilers should be able to make that transformation. I didn't use a shift because it makes the example a little more complicated.

anon99255387 · March 23, 2015, 11:29pm

Bank conflicts only occur between threads in the same warp. __shfl is handling all the exchanges within the warp and does not suffer from shared memory bank conflicts. Thus there is no reason to discuss bank conflicts in this example.

anon95180265 · March 24, 2015, 12:30am

I think Munesh is referring to my "Optimizing Parallel Reduction in CUDA" document. The reason, Munesh, is that in reduction #1, (active) threads index shared memory with a stride of 1, while in #2 they index shared memory with a stride of increasing multiples of 2, meaning increasing number of accesses map to the same bank.

anon83263795 · March 25, 2015, 12:38am

Thanks Mark, I got it. I sketched the mapping on a paper. Sorry had to put the question here as I could not find any forum elsewhere.

Topic		Replies	Views
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3764	February 1, 2016
Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,. CUDA Programming and Performance	53	15429	March 24, 2018
__shfl_down_sync weird behavior CUDA Programming and Performance cuda , ubuntu	5	1602	November 23, 2021
CUDA FORTRAN shared memory warp-level sum reduction Legacy PGI Compilers	1	3443	May 19, 2014
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30749	October 29, 2010
Segmented Reduction of small subarrays CUDA Programming and Performance	17	1643	March 24, 2022
Warp-level reduction before atomic accumulation CUDA Programming and Performance	7	2730	December 13, 2022
Use __shfl_down_sync for multiple variables? CUDA Programming and Performance	2	869	July 7, 2022
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3251	October 5, 2015
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6847	June 2, 2012

Faster Parallel Reductions on Kepler

Related topics