Current state of Device Extension libraries

Hi all - thanks in advance for your time.

In my current application I am doing heavily parallelized matrix operations and want to keep all calculations and device context on the device side (i.e. a single kernel that does multiple consecutive small-ish matrix operations on threads, that runs concurrently across hundreds of instances/blocks). I have had some success using cufftdx (NVIDIA cuFFTDx — cuFFTDx 1.1.0 documentation).

What other device-side libraries are available? I am hoping for something like cuBLAS or cuSOLVER that can be called within a device kernel to compute eigenvalues, pseudoinverse, etc. I know some of these were deprecated a few cuda versions ago. Is there a roadmap for future development?

Thanks
Forrest