Hi, I’m new to cuda programming and wanted to get some ideas of how to go about this algorithm design. I’d like to take a mxm matrix and sum up nxn chunks of blocks inside the matrix, to end up with (assuming m is divisible by n) an (m/n)x(m/n) size matrix of sums. I’ve seen the parallel reduction algorithms to one sum, but i’m not sure how to efficiently handle this particular case. I also realize that you can think of this as an nxn convolution with n stride on a kernel of all ones, but I think i’d like to give a direct approach a try rather than using cudnn.

Thanks!