Matrix Reduction

fbuckle · November 18, 2009, 1:23pm

I have a need to sum a 2D matrix by columns. All of the reduction examples that I have found just reduce a vector. I have implemented the matrix reduction using 2D blocks, and it seems to work OK, but I’m not sure if this is the best way. The pointer arithmetic inside the kernel gets complicated.

Are there examples of reducing a 2D matrix by columns, or rows for that matter? Would it make more sense to launch multiple kernels that each reduce a column as a vector since the kernels can launch asynchronously? I’m new to CUDA programming, so any suggestions or comments would be appreciated.

Thanks

avidday · November 18, 2009, 1:37pm

One approach would be to treat the the matrix as a vector and perform a partial reduction on subvectors, where each subvector correspond to a column. For example, if you had a 1024x1024 matrix, launch 1024 blocks, where each block performs a reduction on a 1024 length subvector. It should be trivial to adapt the very optimal SDK reduction algorithm to do that.

fbuckle · November 18, 2009, 4:03pm

This makes sense. So with this approach, the blockIdx.x value could be used to determine the column that the particular block is to reduce, right? This means that the number of columns would be limited to 65535, correct? That would be fine for my application. Please let me know if I misunderstood. Thanks for your help.

avidday · November 18, 2009, 4:42pm

If you choose a 1D grid, then yes.

No, you could have 65535*65535 columns if you wanted, there is nothing stopping you using a 2D grid. The kernel only has to know the length of each column (plus any padding) in words. You could construct the column number using both dimensions of the grid just as easily as using 1. It makes no difference to the operation of the reduction kernel, which will return one value per block.

fbuckle · November 18, 2009, 6:15pm

Pardon me if I am being dense, but isn’t one dimension of the grid going to be needed if the number of rows exceeds the number of threads that can fit in a block? I tried to post some sample code here, but the forum kept kicking me back to the main forum page. The sample was similar to what is found in one of the reduction example kernels in the SDK.

My data set has 26280 rows and 3648 columns. Using the reduction SDK example, I would use the grid.y dimension to march down the rows in the matrix and perform the full reduction using multiple kernel calls. The grid.x dimension could then be used to identify the appropriate column. Am I still missing the point? Thanks.

avidday · November 18, 2009, 6:36pm

It might be helpful if you think of it this way.

Your “matrix” is really a big block of linear memory. If it is in row major order, then the linear block contains all the rows laid out one after another. If it is in column major order, then all the columns are laid out one after another. To read a column of an MxN “matrix”, a block only needs to read M words, either with a pitch of N if the storage is in row major order, or contiguously if in column major order (with the latter it is probably somewhat simpler to achieve fully coalesced reads).

To perform a column wise reduction, the threads of any block only need to know which column they are supposed to read (calculated using the blockIdx values) and how the columns of the matrix are laid out in memory. Each block performs the reduction in a single kernel pass down to a single value and then write the result back into global memory. There need be no correspondence between the grid layout and the shape of the matrix. The only requirement is that there should be a block for every column in the matrix. Whether a one dimensional or two dimensional grid is used is irrelevant.

fbuckle · November 18, 2009, 6:57pm

I see your point now. With this approach, I would need to loop 26280 times per kernel to fully reduce each column with a single kernel call.

I was going to try and reduce the columns using multiple blocks with each block reducing a part of the column in parallel. It is the multi-block approach that requires multiple kernel calls.

Thanks for helping me to think this through.

avidday · November 18, 2009, 7:20pm

You need to read 26280 values, but you don’t need anything like that much iteration. More like 26280/(threads_per_block) iterations per block. But I am glad you get the idea, anyway.

Topic		Replies	Views
2D Matrix operation CUDA Programming and Performance	5	2075	January 26, 2015
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	439	November 8, 2021
How to do Reduction in column for a matrix CUDA Programming and Performance	2	1497	May 3, 2019
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1800	November 18, 2009
Matrix help CUDA Programming and Performance	13	1503	September 28, 2010
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1120	December 12, 2020
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1715	August 1, 2013
Reduction algorithm and matrices How to apply the SDK reduction algorithm to a half-matrix CUDA Programming and Performance	0	1506	May 15, 2009
Question 2D matrix operation CUDA Programming CUDA Programming and Performance	10	1401	April 7, 2011
Cuda : Reduce (max/min) function on matrix implementation CUDA Programming and Performance	1	1553	August 22, 2019

Matrix Reduction

Related topics