Multiplying arbitrary sized matrices


I an having trouble handling 2D matrices that have dimensions that are not a multiple of the block size.

My setup code looks like this:

[codebox] int wBlocks = P.width/BLOCK_SIZE + (P.width%BLOCK_SIZE ? 1 : 0);

int hBlocks = P.height/BLOCK_SIZE + (P.height%BLOCK_SIZE ? 1 : 0);

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(wBlocks, hBlocks);

MatrixMulKernel<<<grid, threads >>>(Md, Nd, Pd);[/codebox]

And I have tried all kinds of stuff in my kernel to avoid the extra threads from computing or doing anything bad. Like this:


int idx = blockIdx.x * blockDim.x + threadIdx.x;

int idy = blockIdx.y * blockDim.y + threadIdx.y;

if (idx < P.width && idy < P.height)


 // Compute stuff


I’m not so sure what to do at this point. All the documentation I’ve found glosses over this subject, or doesn’t address it at all (subject of multiplying matrices that have dimensions that are not a multiple of the block size).

You could use D/SGEMM in CUBLAS, it is faster than the MatrixMul example and has no restrictions on sizes.

you can always zero-pad your matrices so that the size is an integer multiple of block size.
for instance if block size is (4 x 4), and matrices A and B are (7 x 5) and (5 x 6),
then simply pad them both to (8 x 8), multiply A*B normally, and your answer will be in the top-left (7 x 6) corner of the (8 x 8) result.
Everything else will be zero of course.

I don’t think you even lose any performance by zero padding since you were going to use extra thread blocks anyway.
Hope that helps!

This should work. Make sure, however, that you do not use syncthreads() anywhere inside the conditional code, as syncthreads() only works if all threads reach that barrier.