Efficient summing of a matrix

SRebhan · June 27, 2007, 4:53pm

Hello everybody!

I want to sum up a matrix located in shared memory which has the size warp size times warp size.

Is there an efficient way to do this?

I was thinking of a tree-like structure, where in the first run half of the threads sum up two elements and write their result; sync, than 1/4 of the threads sum up two elements; sync; and so on until only one element is left. Is this a good idea?

Thanks in advance!

Sven

Simon_Green · June 27, 2007, 5:37pm

Yes, the most efficient way of doing a sum reduction in parallel is using a tree-like structure in shared memory as you describe.

See the “scalarProduct” sample in the SDK for a simple example of this.

We’re hoping to include a more optimized reduction example in the next release of the SDK.

Topic		Replies	Views
Summing matrix elements CUDA Programming and Performance	3	6921	July 4, 2011
How to perform multiple small reduction efficiently? CUDA Programming and Performance	3	905	May 24, 2013
sum of all elements of a matrix CUDA Programming and Performance	11	36400	October 18, 2010
Efficient way to split an array in device into two? CUDA Programming and Performance	2	1782	July 2, 2009
shared memory in 1D array operations CUDA Programming and Performance	2	3633	May 19, 2008
sum over a matrix how to parallelize CUDA Programming and Performance	2	3767	November 6, 2009
Fast summation CUDA Programming and Performance	4	1934	March 10, 2009
How to set the priority fro threads ? CUDA Programming and Performance	1	2572	February 23, 2009
scatter and gather with CUDA? CUDA Programming and Performance	3	9824	March 9, 2009
Small SGEMM Tree-like-reduction CUDA Programming and Performance	2	5051	October 17, 2007

Efficient summing of a matrix

Related topics