I have a need to sum a 2D matrix by columns. All of the reduction examples that I have found just reduce a vector. I have implemented the matrix reduction using 2D blocks, and it seems to work OK, but I’m not sure if this is the best way. The pointer arithmetic inside the kernel gets complicated.
Are there examples of reducing a 2D matrix by columns, or rows for that matter? Would it make more sense to launch multiple kernels that each reduce a column as a vector since the kernels can launch asynchronously? I’m new to CUDA programming, so any suggestions or comments would be appreciated.
One approach would be to treat the the matrix as a vector and perform a partial reduction on subvectors, where each subvector correspond to a column. For example, if you had a 1024x1024 matrix, launch 1024 blocks, where each block performs a reduction on a 1024 length subvector. It should be trivial to adapt the very optimal SDK reduction algorithm to do that.
This makes sense. So with this approach, the blockIdx.x value could be used to determine the column that the particular block is to reduce, right? This means that the number of columns would be limited to 65535, correct? That would be fine for my application. Please let me know if I misunderstood. Thanks for your help.
No, you could have 65535*65535 columns if you wanted, there is nothing stopping you using a 2D grid. The kernel only has to know the length of each column (plus any padding) in words. You could construct the column number using both dimensions of the grid just as easily as using 1. It makes no difference to the operation of the reduction kernel, which will return one value per block.
Pardon me if I am being dense, but isn’t one dimension of the grid going to be needed if the number of rows exceeds the number of threads that can fit in a block? I tried to post some sample code here, but the forum kept kicking me back to the main forum page. The sample was similar to what is found in one of the reduction example kernels in the SDK.
My data set has 26280 rows and 3648 columns. Using the reduction SDK example, I would use the grid.y dimension to march down the rows in the matrix and perform the full reduction using multiple kernel calls. The grid.x dimension could then be used to identify the appropriate column. Am I still missing the point? Thanks.
Your “matrix” is really a big block of linear memory. If it is in row major order, then the linear block contains all the rows laid out one after another. If it is in column major order, then all the columns are laid out one after another. To read a column of an MxN “matrix”, a block only needs to read M words, either with a pitch of N if the storage is in row major order, or contiguously if in column major order (with the latter it is probably somewhat simpler to achieve fully coalesced reads).
To perform a column wise reduction, the threads of any block only need to know which column they are supposed to read (calculated using the blockIdx values) and how the columns of the matrix are laid out in memory. Each block performs the reduction in a single kernel pass down to a single value and then write the result back into global memory. There need be no correspondence between the grid layout and the shape of the matrix. The only requirement is that there should be a block for every column in the matrix. Whether a one dimensional or two dimensional grid is used is irrelevant.
I see your point now. With this approach, I would need to loop 26280 times per kernel to fully reduce each column with a single kernel call.
I was going to try and reduce the columns using multiple blocks with each block reducing a part of the column in parallel. It is the multi-block approach that requires multiple kernel calls.
You need to read 26280 values, but you don’t need anything like that much iteration. More like 26280/(threads_per_block) iterations per block. But I am glad you get the idea, anyway.