Triangular solver for ILU1 with cuSparse.

I’m currently involved in a project where we pretend to use the triangular solvers implemented in cuSparse for preconditioning technique in iterative solvers. The intention is to generate the factor ILU0 and ILU1 and maybe ILUT on the CPU and then move then to the accelerator to be solved by the routine cusparseDcsrsv2_solve.

As it can be found on the cuSparse documentation, I’ve reproduce the the steps to use this routine for arbitrary factors L and/or U. So far we have manage to use this routines to solve the incomplete factors L and U from the ILU0 factorization on matrices with around 3 000 to 1 200 000 equations.

Our problem is that when we generate the ILU1 factors for the system with 3 000 equations, the library crash on the analysis phase (cusparseDcsrsv2_analysis) of the U factor with the message CUSPARSE_STATUS_INTERNAL_ERROR. We have made different sorts of validations on this factor and everything seems just OK. It is a superior triangular matrix with non unit diagonal as indicated on the opaque data structure cusparseMatDescr_t.

We are really locking forward to use this library in a regular basis for our solver in distributed memory using the available GPUs nvidia. Is there a way to know what could be producing this error on the analysis phase?. Is there some limitations to be consider when using this cuSparse function?.

More details about our experiments are the follow, the cusparseSolvePolicy_t is CUSPARSE_SOLVE_POLICY_USE_LEVEL for both cases L and U. When using CUSPARSE_SOLVE_POLICY_NO_LEVEL the analysis phase does not crash, but when calling cusparseDcsrsv2_solve for the U factor, the error

call to cuMemFreeHost returned error 700: Illegal address during kernel execution rises. If I understand this is an error associated to out-of-bounds memory access on the accelerator, but I don’t have any clue what could be causing this.

If needed, the matrices factors to reproduce this errors could be supplied.

Since I manage memory creation and destruction and basic BLAS operations with openACC directives, this code is compiled with the follow line.

pgc++ -w -std=c++11 -acc -ta=tesla:cc35 -ta=tesla:cuda9.1 -Mcuda -Mcudalib=cusparse -Minfo -o main main.cpp