I’ve noticed that cublas dgemm is noticeably faster when the dimensions of the input matrices are multiples of 32 (it might be some other number e.g. 16, 64, etc. but I haven’t tried narrowing it down yet). Given this is the case, is it a good idea to extend the matrix size by padding the entries with zeroes? Are there routines that automatically take care of this process on the GPU?

EDIT: I’ve noticed that the MAGMA library optimizes well with square matrices but same can’t be said for rectangular matrices.