If you are asking whether you can call CUBLAS from inside a kernel, then the answer is no.
It might sound restrictive, but it doesn’t necessarily have to be. My approach to this has been to use the host to coordinate the main stages of a given algorithm, with each stage being a separate kernel or CUBLAS call, while keeping as much data on the GPU as possible to minimize PCI-e bus overheads. If your algorithm needs intermediate data from the device inside the host loop, you can do things like queue kernels, overlap copying with kernel execution, use zero copy memory access (if your GPU supports it) and perform host side operations in parallel with the GPU to help hide PCI-e bus latency and bandwidth limitations.
There are a couple of good papers (one on Lapack style dense matrix factorization by V.Volkov from UC Berkley and one from M.Fatica from NVIDIA on LINPACK benchmark acceleration) which show how effective this process can be.