Parallel reduction of nxn blocks in mxm matrix

Hi, I’m new to cuda programming and wanted to get some ideas of how to go about this algorithm design. I’d like to take a mxm matrix and sum up nxn chunks of blocks inside the matrix, to end up with (assuming m is divisible by n) an (m/n)x(m/n) size matrix of sums. I’ve seen the parallel reduction algorithms to one sum, but i’m not sure how to efficiently handle this particular case. I also realize that you can think of this as an nxn convolution with n stride on a kernel of all ones, but I think i’d like to give a direct approach a try rather than using cudnn.


just run (m/n)x(m/n) blocks, each computing sum in submatrix and look at

Oh crazy, didn’t realize it could be this simple, just grab local thread id and do all the same reductions… Would the local threads (inside block) also have the same shfl_down exploit for <32?