Hello,

I was wondering how to sum up all the elements of a matrix using CUDA. Unlike matrix multiplication/addition where I basically get each thread to compute one element of the final matrix, over here there is only one final result, so I was wondering how one could set something like this up.

I could get each block to compute the sum of each row of the matrix but after that since I cannot synchronize across blocks, I do not know how I would add up these partial sums.

Thanks in advance