cublas padding zeros

I’ve noticed that cublas dgemm is noticeably faster when the dimensions of the input matrices are multiples of 32 (it might be some other number e.g. 16, 64, etc. but I haven’t tried narrowing it down yet). Given this is the case, is it a good idea to extend the matrix size by padding the entries with zeroes? Are there routines that automatically take care of this process on the GPU?

EDIT: I’ve noticed that the MAGMA library optimizes well with square matrices but same can’t be said for rectangular matrices.

Just use cudaMemset() to zero the padded device array first, then copy the “valid” data on top of the zeroed array.

suppose you do C = A * B

do you try to set lda (leading dimension of matrix A), and ldc to be multiple of 32 even

dimension of A, C is not multiple of 32?

DGEMM of CUBLAS has large variation when you sweep dimension N.

you can use Volkov’s code, it is more uniform on performance.