 # sum of all elements of a matrix

Hello,

I was wondering how to sum up all the elements of a matrix using CUDA. Unlike matrix multiplication/addition where I basically get each thread to compute one element of the final matrix, over here there is only one final result, so I was wondering how one could set something like this up.

I could get each block to compute the sum of each row of the matrix but after that since I cannot synchronize across blocks, I do not know how I would add up these partial sums.

Check the parallel prefix sums sample that comes with CUDA, I believe it’s called “scan.” It does more than you need, in a sense that it computes all partial sums of a set of n elements, but it performs the same number of arithmetic operations as a plain sum.

Paulius

The scalarprod sample also does a 1D sum reduction to compute the scalar product of two large vectors.

Extending this to a 2D matrix should be straightforward, but there will be lots of room for optimization (even beyond what the scalarprod sample does). For example, loop unrolling can be very beneficial.

Mark

We’ll have a reduction example in the next release of the SDK.

Basically, to sum n values, you need n/2 threads and log(n) steps. At each step each thread sums the value at its position with the corresponding value from the other half of the array.

The code to do a reduction in shared memory looks something like this:

``````   unsigned int tid = threadIdx.x;

for(unsigned int stride=n/2; stride>0; stride>>=1) {

if (tid < stride) {

smem[tid] = smem[tid] + smem[tid + stride];

}

}
``````

Thanks for the code snippet. I implemented something similar dividing the matrix (512x512) up into 16 blocks and then used two kernels. The first time around, the block sums are computed, using the reduction style shown in your code, and the next time around the 16 numbers are added.

Doing this, it took around 0.16 ms to sum up all the elements of a 512x512.

Would you mind trying larger matrices and listing the times here? The reason I’m asking is that I did a similar computation using Cg - I was computing the mean of an nxn image, where each pixel had 4 components (mean was computed for each of the components). So, I was effectively adding up an nxn matrix 4 times. I got the following times on a G80:

512x512: 0.127ms

1024x1024: 0.323ms

2048x2048: 1.099ms

4096x4096: 3.342ms

The images were textures and I did take advantage of linear filtering (each read returned the mean of a 2x2 pixel region), so that may be the reason for the time difference between our computations. You might be able to do the same with CUDA, loading your matrix into a texture and using CUDA’s texture fetches. You’d have to multiply the final sum by 4 at the end. Filtering may cause precision issues, depending on your data, but it’d be worth a try.

Paulius

I’m guessing loop indexing and addressing overhead is a big factor too.

I sped up my scan code by 40% (beyond what is in the SDK) or so by unrolling loops.

Mark

does the compiler figure out the unrolling automatically or you need to set an option ?

The compiler decides when unrolling is appropriate. There’s no compiler option to specifically control this.

Actually, the beta compiler does not have any loop unrolling.

Mark

Actually this is the problem of correlation two arrays, which I’m now performing with CUDA. The multiplication can be performed by totally parallel threads. But the sum of different blocks must use optimal methods. Anyone, who can give me some advice? Thanks a lot

Actually this is the problem of correlation two arrays, which I’m now performing with CUDA. The multiplication can be performed by totally parallel threads. But the sum of different blocks must use optimal methods. Anyone, who can give me some advice? Thanks a lot