Matrix Multiplication with Shared Memory

I have written a code for matrix multiplication for shared memory based on the example in CUDA programming guide. I wish to extend the code for matrices with arbitrary sizes. How can I achieve this for the shared memory case, since block multiplication is used?

Thanks in advanced.