How to perform multiple small reduction efficiently?

fatalme · May 22, 2013, 5:43pm

Hi:

Say I have an array a[N], I want to perform reduction such as sum/min/max for sub arrays of length 10, how can I do this efficiently? A big thread block or a small thread block 32?

Thanks.

Uncle_Joe · May 22, 2013, 7:17pm

A simple, but reasonably fast solution would be to use 1 warp for each sub array, and within each warp, use warp synchronous programming to eliminate the need of syncthreads() (you can find the example in the reduction algorithm documentation from the SDK).

It seems the essence of the problem is whether to exploit ||ism within each sub-array reduction, or ||ism across all sub arrays, or both. The solution above exploits both, but will have warp under utilization since your sub array size < warpSize. If you use 1 warp per sub array, the utilization will be very low:

5/32 on 0th iteration, (10 elements remaining)
2/32 on 1st iteration, (5 elements)
1/32 on 2nd iteration, (3 elements)
1/32 on 3rd iteration (2 elements)

There’s the possibility of using only 1 thread to reduce each sub array, but that would involve non-contiguous reads (wouldn’t be a problem if in shared memory and using padding to keep the ith element of each sub-array in separate banks) or some clever interleaved data format.

sBc-Random · May 23, 2013, 2:37am

How are the sub-arrays held on memory? Are they stored as a matrix?
If so, you could use 1 thread/array quite efficiently
Consider:
1’ * A reduces every column of A using sum. An equivalent measure could be defined for min,max as well obviously.
That requires shared memory. Easier still:
At * 1 if you have the transpose available. Such a kernel would generate coalesced reads and no shared mem usage at all.

fatalme · May 24, 2013, 3:48pm

For thread block size of 32:

device void reduce512( float smem512, unsigned short tID, unsigned short subBlock){ // reduce 32

if(tID<25){
	smem512[tID]+=smem512[tID+5];
	smem512[tID]+=smem512[tID+1]+smem512[tID+2]+smem512[tID+3]+smem512[tID+4];
}

}

Topic		Replies	Views
Reducing Multiple Arrays CUDA Programming and Performance	0	864	August 10, 2009
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1800	November 18, 2009
Reduction on odd number of thread / block CUDA Programming and Performance	5	1929	December 15, 2012
Any good ideas for this special "reduction" ? CUDA Programming and Performance	10	6796	November 20, 2009
"any"/"all" boolean operation between threads Efficient thread co-oporation CUDA Programming and Performance	6	2957	February 18, 2008
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	439	November 8, 2021
Reduction Problem CUDA Programming and Performance	5	4728	October 13, 2010
reduction example in SDK CUDA Programming and Performance	1	4358	June 24, 2010
Efficient summing of a matrix CUDA Programming and Performance	1	3741	June 27, 2007
Using shared memory where a variable number of threads shares some data. CUDA Programming and Performance	3	4309	May 14, 2011

How to perform multiple small reduction efficiently?

Related topics