Matrix inverse from device code

Hi all,

I’m trying to parallelize an algorithm that needs to perform matrix inversion. Every thread will have a matrix of various size and I’m struggling to do batch processing, so I’d like to call the matrix inversion function directly from my kernel. Is there a way to do this? So far, my matrices always were hermitian, so I just implemented a Cholesky decomposition and diagonal inversion to perform the inversion, but I’m now dealing with regular non-singular matrices and I don’t want to implement the whole inversion function…
I’m a bit confused about which external libraries I can use from device code… I heard about cuBLAS, but it seems ti be deprecated since Cuda 10.0, right…?

Thanks !