What is the state of device-side libraries?

I’ve seen that calling cuBLAS from kernels is deprecated since CUDA 10.0. Is it possible, and is there an ecosystem for libraries of device functions? For example, I have a problem where I am doing hundreds of thousands of independent estimations via an EM algorithm, each of which involves one LU decomposition of a matrix, and then solving a linear system some number of times. I have written functions to do these, but it seems odd that I wouldn’t have ready access to BLAS or LAPACK type libraries. I have tried searching, but it is difficult to distinguish between, for example, GPU-accelerated linear algebra libraries, and libraries to be called from device code.