Efficient summing of a matrix

Hello everybody!

I want to sum up a matrix located in shared memory which has the size warp size times warp size.

Is there an efficient way to do this?

I was thinking of a tree-like structure, where in the first run half of the threads sum up two elements and write their result; sync, than 1/4 of the threads sum up two elements; sync; and so on until only one element is left. Is this a good idea?

Thanks in advance!

Sven

Yes, the most efficient way of doing a sum reduction in parallel is using a tree-like structure in shared memory as you describe.

See the “scalarProduct” sample in the SDK for a simple example of this.

We’re hoping to include a more optimized reduction example in the next release of the SDK.