Multi-Thread or Batched cuSolver


I am using cuSolver to solve 3104 AX=B matrices while A is 186x186 and B is a vector with length 186.

I would like to just use a batch function but I dont see any batch functions for cuSolverDN like there is for cufft.

Instead I am trying to launch a kernel that will use cusolverDnSpotrs() on each thread. But I am getting an error saying “calling a host function(“cusolverDnSpotrs”) from a global function(“kernelCall”) is not allowed”

I have done this before with cublas and it works wonderfully. Even in the documentation of cublas it says “the recommended programming model is to create one CUBLAS handle per thread and use that CUBLAS handle for the entire life of the thread”

I am trying to create a cuSolver handle per thread but it gives me the same error as above “calling a host function(“cusolverDnCreate”) from a global function(“kernelCall”) is not allowed”

I really don’t want to have a 3104 long loop on the host that runs the solver for each matrix solution. I understand that I could use streams and stuff but there has got to be a better way.

Could someone point me in the right direction?

You cannot make cusolver calls from a CUDA kernel. The api calls must be made from host code.

CUBLAS is different. The cublas library has call-from-device-code support.

Thanks txbob.

I wrote a quick test to see if using streams to call 3104 cuSolver calls but it is still too slow. I found that doing one matrix solve is just under 1ms. But when I solve all 3104 equations it takes over 2.5 seconds. I only have 2 seconds to do all of my processing before I receive another 3104 to solve.

Is there any cuSolver api calls that do something similar to cusolverDnSpotrs that are batched? cusolverDnSpotrs takes advantage of AX=B when A is Hermitian.