Optimize problem regarding problem size

Hey there,
let’s assume I have a problem like a matrix multiplication: When the problem size is quite small, shared memory would be the most efficient way to calculate the resulting matrix, but is limited in size. So when the limit is exceeded one has to use global memory. Is there any easy way to do this in ONE code? I think the code for calling the kernel (blocksize, gridsize) also has to be adopted, right!?
Which package could I use to solve those things like matrix multiplication or vector reductions etc? Do they automatically use the faster memory for storing depending on the problem size?

Matrix multiplication can also be performed in tiles, which can be chosen small enough to still fit into shared memory regardless of the matrix size.

Oh okay, true, just was confused. So do you know a good package which does this performantly, I don’t want to reinvent the wheel always :)

CUDA ships with CUBLAS, which contains a pretty good gemm() implementation for matrix-matrix multiplication.

Okay thanks so far. Probably better using it than implementing it myself. One would probably not get to the performance of a already implemented thing from nvidia :)