performance of cusparseDcsrsv_analysis

Can anybody help me around this weird phenomena ?

I wrote a Conjugate-gradient library for solving linear algebraic systems of equations, I use LU factorization, so in the residuals updating step, I need to perform a triangular matrix solve twice, however, the analysis step (cusparseDcsrsv_analysis) of the triangular solver takes alot of time ! for instance, if the whole solver is to need 360 ms to converge, these two lines of analysis would need 330 ms of them !

cusparseMatDescr_t descrL = 0 ;

cusparseMatDescr_t descrU = 0 ;

cusparseStatus = cusparseCreateMatDescr(&descrL) ;

cusparseStatus = cusparseCreateMatDescr(&descrU) ;

cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_TRIANGULAR) ;

cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE) ;

cusparseSetMatDiagType(descrL,CUSPARSE_DIAG_TYPE_UNIT) ;

cusparseSetMatFillMode(descrL,CUSPARSE_FILL_MODE_LOWER) ;

cusparseSetMatType(descrU,CUSPARSE_MATRIX_TYPE_TRIANGULAR) ;

cusparseSetMatIndexBase(descrU,CUSPARSE_INDEX_BASE_ONE) ;

cusparseSetMatDiagType(descrU,CUSPARSE_DIAG_TYPE_NON_UNIT) ;

cusparseSetMatFillMode(descrU,CUSPARSE_FILL_MODE_UPPER) ;

cusparseSolveAnalysisInfo_t inforL = 0 ;

cusparseSolveAnalysisInfo_t inforU = 0 ;

cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforL) ;

cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforU) ;

startSP = omp_get_wtime() ;

cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZLT, descrL, matrixLT, iRowLT, jColLT, inforL) ;

if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis1 Error !") ;

cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZUT, descrU, matrixUT, iRowUT, jColUT, inforU) ;

if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis2 Error !") ;

finishSP = omp_get_wtime() ;

cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, &c2, descrL, matrixLT, iRowLT, jColLT, inforL, r, t) ;

if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_solve1 Error !") ;

cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, &c2, descrU, matrixUT, iRowUT, jColUT, inforU, t, z) ;

if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_solve2 Error !") ;

I appreciate your help very much, Thanks in advance.

It looks like you are using some sort of hybrid approach since you mention both CG (an iterative method) and LU factorization (a direct method)? I asked the CUSPARSE developers for comments. Their response (with some re-phrasing by me):

[1] There is a technical report at http://research.nvidia.com/publication/parallel-solution-sparse-triangular-linear-systems-preconditioned-iterative-methods-gpu that explains that the algorithm is designed to be used in an iterative setting, where the slower analysis phase is performed only once, while the faster solve phase is performed multiple times.

[2] When using an iterative method the analysis should be performance once outside of the main loop, while the solve is performed multiple times inside the loop. There is a white paper at http://developer.nvidia.com/content/incomplete-lu-and-cholesky-preconditioned-iterative-methods-using-cusparse-and-cublas on how to set this up.

[3] When solving linear systems using a direct method, with the need to perform a single lower and upper triangular solve, this might not be the right algorithm, because the cost of the analysis phase is not amortized across multiple iterations.

Thanks for your prompt reply, actually I’m using an iterative method (CG) but I’m using the LU factorization for preconditioning the coefficients matrix (i.e. updating residuals in each iteration), so I’m still using an iterative method anyways, the thing is that, if the solution is to converge in a given number of iterations (say 2 iterations) the analysis step would take ~330 ms (only executed once) while the whole 2 iterations would take ~30 ms, I’ve read one of the papers you posted, and I know the analysis part should take more time than solving, but is it normal to use 11x (330/30) time ?!

In most of my trials, the (Analysis)/(Overall solution including all iterations) time ratio is around 85% ! and I repeat, the analysis is done only once !

Is it better to move this post to the “CUDA Programming and Development” or what, is it against forum rules if I re-post it there myself ?