Does anyone know a fast arbitrary size matrix multiplication algorithm/code on GPU?

The matrix multiplication from SDK seems only work when input matrix has a size of multiple of 16. For example, if input matrix is 127X127, it returns wrong results.