Declare __shared__ memory in __device__ function

Hello guys,

Can I declare and use shared memory in a device function? Most of the time, I see the shared memory used in global functions. However, under certain circumstances, the use of shared memory in a device function can make the solution much more convenient.

For my project, I need to use matrix multiplication in every kernel. I originally thought about using the CUBLAS library in each kernel, just like using BLAS from MKL in each of my threads, but then I find it unrealistic, since you will have to declare a CUBLAS handle in each kernel. That is why I am considering writing a MATRIX-MATRIX multiplication myself using device function. However, to achieve the best performance for the multiplication, the shared memory has to be used, which requires me to declare shared memory in the device function.

I wonder if it is practical to make such a declaration, and thanks for any advice!