How to perform multiple small reduction efficiently?


Say I have an array a[N], I want to perform reduction such as sum/min/max for sub arrays of length 10, how can I do this efficiently? A big thread block or a small thread block 32?


A simple, but reasonably fast solution would be to use 1 warp for each sub array, and within each warp, use warp synchronous programming to eliminate the need of syncthreads() (you can find the example in the reduction algorithm documentation from the SDK).

It seems the essence of the problem is whether to exploit ||ism within each sub-array reduction, or ||ism across all sub arrays, or both. The solution above exploits both, but will have warp under utilization since your sub array size < warpSize. If you use 1 warp per sub array, the utilization will be very low:

5/32 on 0th iteration, (10 elements remaining)
2/32 on 1st iteration, (5 elements)
1/32 on 2nd iteration, (3 elements)
1/32 on 3rd iteration (2 elements)

There’s the possibility of using only 1 thread to reduce each sub array, but that would involve non-contiguous reads (wouldn’t be a problem if in shared memory and using padding to keep the ith element of each sub-array in separate banks) or some clever interleaved data format.

How are the sub-arrays held on memory? Are they stored as a matrix?
If so, you could use 1 thread/array quite efficiently
1’ * A reduces every column of A using sum. An equivalent measure could be defined for min,max as well obviously.
That requires shared memory. Easier still:
At * 1 if you have the transpose available. Such a kernel would generate coalesced reads and no shared mem usage at all.

For thread block size of 32:

device void reduce512( float smem512, unsigned short tID, unsigned short subBlock){ // reduce 32