sparse cusolver inside loop .................. factorization at every call??

Hi,

I am wondering whether there is any cusolver which can be used as a replacement for intel mkl pradiso.

I am dealing with the problem Ax=b, where “A” is sparse, symmetric and positive definite, and x and b are vectors which can hold multiple righthand sides/solutions. “A” is constant throughout the program but “Ax=b” is called in different parts of the program with different “x”'s and “b”'s.

So far I have used pardio which has a preparation, factorization and solving step, which can be called individually. Thus, preparation and the most expensive factorization can be called once at the beginning of the program, and the solver calls can be repeated anywhere throughout the program without much cost because the factors are reused.

With regard to cuda I have looked at “cusolverSpScsrlsvlu”, “cusolverSpScsrlsvqr” and “cusolverSpScsrlsvchol” and it appears to me that these function will do (redo) the factorization each time they are called which will produce a massive overhead. Is that correct?? And if so, is there any way to circumvent it??

Thanks

Take a look at the sample codes cuSolverSp_LowlevelQR and cuSolverSp_LowlevelCholesky

Using the first one as an example, the whole solution process is first done using the Host API in steps up through 8.

After that the process is repeated using the device API in steps 9-14. AFAIK steps 9-13 can be done once, and repeat step 14 as many times as needed for different RHS if the A matrix doesn’t change.

Hi, thanks for the quick response. I had a look at the example, and there is a function which I struggle to find in the cuda online documentation : “cusolverSpDcsrcholFactorHost”. I looked here: https://docs.nvidia.com/cuda/cusolver/index.html#cusolver-low-level-function-reference but I could not find anything. Of course after knowing what to look for I found file “cusolverSP_LOWLEVEL_PREVIEW.h”. So what am I doing wrong when not finding these functionality described in the online documentation (which is the reason for the thread)??

Thanks

Because they (various cuSolverSp functions used in those sample codes) are not in the online documentation, AFAIK.

The fact that these functions are in a special header file with PREVIEW in the name seems to be significant.

I have an internal inquiry about it, but the earliest the documentation could change would be the next CUDA release (and I have no guarantees about it changing then).

Thanks a lot.

Hi Robert,

so I got all that to work, but I noticed that there is no reordering step before the call “cusolverSpDcsrcholfactor”. According to “cusolverSpDcsrcholBufferInfo” the matrix I am using requires 6,581,070,336 bytes of memory for factorization. Using the same matrix in MKL pardiso, pardiso reports a peak memory usage of 3.87 giga bytes, almost half of what cuda wants. This is presumably due to cuda not reordering the matrix to reduce fillins. Is there any option (which I may have overlooked) to induce a reordering before the factorization call??

Futher, is there any option to retrieve the factor from the device once the factorization is done. For a smaller example matrix where everything went successfully I tried to copy values from the location of “*pbuffer” but what I got was only zeros. My guess is that cuda is not “giving away” the factor (similar to MKL), but I might be wrong.

Finally, is there any option for multiple right-hand side (rhs) in a single call of “cusolverSpDcsrcholSolve”. My experience from MKL is that it takes much more time to call pardiso in a loop n times for n rhs compared to calling it once and deliver all rhs at once.

Thanks.