arbitrary size matrix multiplication

Does anyone know a fast arbitrary size matrix multiplication algorithm/code on GPU?

The matrix multiplication from SDK seems only work when input matrix has a size of multiple of 16. For example, if input matrix is 127X127, it returns wrong results.

As far as I remember SDK example uses shared memory to do the multiplication, thus the 16x restriction. You could try define a matrices size that is neares bigger then the size You need and is a multiple of 16. Given 127x127 matrices multiplication define it as 128x128 with 127th row and 127th column (indexing from zero) filled with zeros. Because matrix multiplication is basically a set of dot products between proper rows and columns it should produce correct outcome. But I don’t remember exactly how the SDK example is constructed.