I have a need to sum a 2D matrix by columns. All of the reduction examples that I have found just reduce a vector. I have implemented the matrix reduction using 2D blocks, and it seems to work OK, but I’m not sure if this is the best way. The pointer arithmetic inside the kernel gets complicated.

Are there examples of reducing a 2D matrix by columns, or rows for that matter? Would it make more sense to launch multiple kernels that each reduce a column as a vector since the kernels can launch asynchronously? I’m new to CUDA programming, so any suggestions or comments would be appreciated.

Thanks