question about matrixMul and convolution sdk examples

i have a question about those examples, in both examples the dimension either of the matrices or of the filter kernel where fixed, so that it was pretty straightforward in which manner to use shared memory. my question is if it is possible to use shared memory the same way if you want to use arbitrarily sized matrices or filter kernels?
i have been thinking quite a lot about that problem, because i have to realise functions with arbitrarily sized matrices or filter kernels, and do not find a feasible solution. what would be the performance penalty if i would realise all of it from texture memory instead of shared. i know there is a graph comparing those two possibilities in the convolutionSeperable sdk example, but does this rely on up to date hardware? my solution is supposed to be running exclusively on fermi architecture. what are your thoughts?

thanks in advance


For small kernel sizes, you can try out the shared memory stuff.

You can’t use limited size Shared memory for arbitrarily sized matrices or filter kernels, so only textures can help you, but as Fermi has L1/L2 cache as well, you can try out global memory Vs texture implementation.

Please post the results you obtain performing this experiment and the code as well, if possible.