Sparse cusolver inside loop .................. factorization at every call?

Hi,

I am wondering whether there is any cusolver which can be used as a replacement for intel mkl pradiso.

I am dealing with the problem Ax=b, where “A” is sparse, symmetric and positive definite, and x and b are vectors which can hold multiple righthand sides/solutions. “A” is constant throughout the program but “Ax=b” is called in different parts of the program with different “x”'s and “b”'s.

So far I have used pardio which has a preparation, factorization and solving step, which can be called individually. Thus, preparation and the most expensive factorization can be called once at the beginning of the program, and the solver calls can be repeated anywhere throughout the program without much cost because the factors are reused.

With regard to cuda I have looked at “cusolverSpScsrlsvlu”, “cusolverSpScsrlsvqr” and “cusolverSpScsrlsvchol” and it appears to me that these function will do (redo) the factorization each time they are called which will produce a massive overhead. Is that correct?? And if so, is there any way to circumvent it??

Thanks

Take a look at the sample codes cuSolverSp_LowlevelQR and cuSolverSp_LowlevelCholesky

Using the first one as an example, the whole solution process is first done using the Host API in steps up through 8.

After that the process is repeated using the device API in steps 9-14. AFAIK steps 9-13 can be done once, and repeat step 14 as many times as needed for different RHS if the A matrix doesn’t change.

Hi, thanks for the quick response. I had a look at the example, and there is a function which I struggle to find in the cuda online documentation : “cusolverSpDcsrcholFactorHost”. I looked here: [url]https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-low-level-function-reference[/url] but I could not find anything. Of course after knowing what to look for I found file “cusolverSP_LOWLEVEL_PREVIEW.h”. So what am I doing wrong when not finding these functionality described in the online documentation (which is the reason for the thread)??

Thanks

Because they (various cuSolverSp functions used in those sample codes) are not in the online documentation, AFAIK.

The fact that these functions are in a special header file with PREVIEW in the name seems to be significant.

I have an internal inquiry about it, but the earliest the documentation could change would be the next CUDA release (and I have no guarantees about it changing then).

Thanks a lot.

Hi Robert,

so I got all that to work, but I noticed that there is no reordering step before the call “cusolverSpDcsrcholfactor”. According to “cusolverSpDcsrcholBufferInfo” the matrix I am using requires 6,581,070,336 bytes of memory for factorization. Using the same matrix in MKL pardiso, pardiso reports a peak memory usage of 3.87 giga bytes, almost half of what cuda wants. This is presumably due to cuda not reordering the matrix to reduce fillins. Is there any option (which I may have overlooked) to induce a reordering before the factorization call??

Futher, is there any option to retrieve the factor from the device once the factorization is done. For a smaller example matrix where everything went successfully I tried to copy values from the location of “*pbuffer” but what I got was only zeros. My guess is that cuda is not “giving away” the factor (similar to MKL), but I might be wrong.

Finally, is there any option for multiple right-hand side (rhs) in a single call of “cusolverSpDcsrcholSolve”. My experience from MKL is that it takes much more time to call pardiso in a loop n times for n rhs compared to calling it once and deliver all rhs at once.

Thanks.

Hi, Robert! Firstly, thanks for reading my message. I have sufferred from the similar question like others when solving a linear equation systems with the cuda library function. I used the function in cusolver library, or “cusolverSpDcsrlsvqr” in the sample “cuSolverSp_LinearSolver”, to solve a equations system whose coefficient matrix is a 50,000 x50,000 sparse matrix). However, I find the cusolver takes much longer time (68 sec per step) than that pardiso do in Fortran (2.3 sec per step). As is known that the gpu parrallelism is more efficient than the cpu parrallelism, could you provide some help with me?Thank you so much for your patience.

Hi

Please check out cuDSS, direct sparse solver for NVIDIA GPUs

1 Like

Hi asundaram! Thanks so much for your previous suggestion about sovling a sparse linear equation. After installing the cudss, I find there are problems when I try that "[cuDSS Example 1] (CUDALibrarySamples/cuDSS/simple at master · NVIDIA/CUDALibrarySamples · GitHub) “. Firstly, I set the path for the cudss library and header documents according to " Installation and Compilation”, but the cudss library function can not be recognized. Secondly, after that I copy the cudss library and header documents into the path of “include” and “lib” in cuda 12.0, the cudss code can run. However, as is mentioned that the solution of [cuDSS Example 1] is {1,2,3,4,5}, I check the solution was {0.825, 1.53, 3.7, 4, 6.5} after using other tools, such as “cusolverSpDcsrlsvqr” in the sample “cuSolverSp_LinearSolver”. So, is there any problem during using the cudss sample and calling the cudss library function. Thank you very much for your patience and reading this messages.