Problem with "cusparseDcsrsv"

OmidKar · May 21, 2013, 2:27am

I coded a Preconditioned Conjugate Gradient solver (for linear system of equations) with Incomplete Cholesky preconditioning using CUBLAS and CUSPARSE libraries. I used Dr. Maxim Naumov’s papers on nvidia’s research community as a guideline. I have the matrix in CSR format and the lower triangular Cholesky factor is also obtained on CPU and is sent and present in GPU before the PCG procedure.
The related code is as follows:

/* Create info objects for the IC0 preconditioner */
    cusparseSolveAnalysisInfo_t info_L;
    cusparseCreateSolveAnalysisInfo(&info_L);

	cusparseSolveAnalysisInfo_t info_Lt;
    cusparseCreateSolveAnalysisInfo(&info_Lt);

    cusparseMatDescr_t descrL = 0;
    cusparseStatus = cusparseCreateMatDescr(&descrL);
    cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_TRIANGULAR);
    cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE);
    cusparseSetMatFillMode(descrL, CUSPARSE_FILL_MODE_LOWER);
    cusparseSetMatDiagType(descrL, CUSPARSE_DIAG_TYPE_NON_UNIT);

    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 
					         lnz,descrL, d_valsIC0, d_rowIC0, d_colIC0, info_L);

    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, 
						lnz, descrL, d_valsIC0, d_rowIC0, d_colIC0, info_Lt);

    iter = 0;
    cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
    tol = Rn;
    while (r1 > tol*tol && iter <= max_iter)
    {
        // Forward Solve, we can re-use infoA since the sparsity pattern of A matches that of L
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 
				      &done, descrL,d_valsIC0, d_rowIC0, d_colIC0, info_L, d_r, d_y);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        // Back Substitution
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, &done, 
					   descrL,d_valsIC0, d_rowIC0, d_colIC0, info_Lt, d_y, d_zm1);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        iter++;

        if (iter == 1)
        {
            cublasDcopy(cublasHandle, N, d_zm1, 1, d_p, 1);
        }
        else
        {
            cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
            cublasDdot(cublasHandle, N, d_rm2, 1, d_zm2, 1, &denominator);
            beta = numerator/denominator;
            cublasDscal(cublasHandle, N, &beta, d_p, 1);
            cublasDaxpy(cublasHandle, N, &done, d_zm1, 1, d_p, 1) ;
        }

        cusparseDcsrmv(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz, &done, descr, d_val, 
                                        d_row, d_col, d_p, &dzero, d_omega);

        cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
        cublasDdot(cublasHandle, N, d_p, 1, d_omega, 1, &denominator);
        alpha = numerator / denominator;
        cublasDaxpy(cublasHandle, N, &alpha, d_p, 1, d_x, 1);
        cublasDcopy(cublasHandle, N, d_r, 1, d_rm2, 1);
        cublasDcopy(cublasHandle, N, d_zm1, 1, d_zm2, 1);
        nalpha = -alpha;
        cublasDaxpy(cublasHandle, N, &nalpha, d_omega, 1, d_r, 1);
        cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
    }

    printf("  iteration = %3d, residual = %e \n", iter, sqrt(r1));

The matrix is stored on GPU in: d_val, d_row(), d_col, with dimension N and number of nonzeros nz.
The lower-triangular Cholesky factor is stored on GPU in: d_valIC0, d_rowIC0, d_colIC0 with the number of nonzeros to be lnz.

When I run the program, I get the following at the very first iteration: “residual = 1.#QNAN0e+000”.

I tested the data with stand-alone CG and it worked well. Also, it works well on CPU using MKL.
my guess is that there is a problem with “cusparseDcsrsv_solve”, yet I am not sure. I have also tried other configs for the matrix description, but it failed with the same error all the time. could anybody help me with this issue?

Note: my platform is windows 7-64bit - I am using CUDA 5.0 and my GPU is Tesla k20.

Many thanks,

Omid

njuffa · May 21, 2013, 7:15am

Have you considered the possibility of a mismatch between the matrix descriptor settings and the actual matrix? What is the smallest matrix that reproduces the problem? Could you show the matrix or a least the first 10 rows of it?

OmidKar · May 22, 2013, 6:57am

Hi njuffa,

Thanks for your reply.
Yes, I have double checked the matrix and also the IC factor. I noticed that it is mentioned in the cusparse manual that
“Sparse matrices in CSR format are assumed to be stored in row-major CSR format, in other words, the index arrays are first sorted by row indices and then within the same row by column indices.”
My matrices (both the coefficient matrix and the IC factor) do not exactly match these criteria (the order of column indices in each row is not increasing).
would this be the reason of my problem?
If so, how come the Cg without preconditioning is working fine in spite of my wrong matrix storage?

Thank you,

Omid

njuffa · May 22, 2013, 11:52pm

Your matrix must follow the ordering requirements spelled out by the documentation exactly. Components of the solver code that come into play when using the preconditioner rely on this ordering. Violating the ordering requirements results in the NaNs you are seeing. Without the preconditioner, the solver code may “happen to work” when the ordering isn’t quite following the specification since some solver components that rely on it are not used.

In general it is a good idea to follow the requirements spelled out in documentation. The fact that some code may work by chance if some requirements are not met does not change this basic fact.

OmidKar · May 23, 2013, 3:48am

Thanks for your help. I got the program working fine. As you mentioned, the problem was with the storage format of the matrices.
In my previous program, I computed the cholesky factor on the cpu and then send it to the gpu to be used with cusparse PCG algorithm.

No, I would like to use the ic0 preconditioning routine available in cusparse 5.0, ie cusparseDcsric.
I coded the program as follows:

/* create the analysis info object for the A matrix */
    cusparseSolveAnalysisInfo_t infoA = 0;
    cusparseStatus = cusparseCreateSolveAnalysisInfo(&infoA);

    if (checkCudaErrors(cusparseStatus)) return EXIT_FAILURE;

    /* Perform the analysis for the Non-Transpose case */
    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                                             N, nz, descr, d_val, d_row, d_col, infoA);

    if (checkCudaErrors(cusparseStatus))
    {
        exit(EXIT_FAILURE);
    }

    /* Copy A data to IC0 vals as input*/
    cudaMemcpy(d_valsIC0, d_val, nz*sizeof(double), cudaMemcpyDeviceToDevice);

    /* generate the Incomplete LU factor H for the matrix A using cudsparseScsrilu0 */
    cusparseStatus = cusparseDcsric0(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, descr, 
                                     d_valsIC0, d_row, d_col, infoA);

    if (checkCudaErrors(cusparseStatus))
    {
        exit(EXIT_FAILURE);
    }

    /* Create info objects for the IC0 preconditioner */
    cusparseSolveAnalysisInfo_t info_Lt;
    cusparseCreateSolveAnalysisInfo(&info_Lt);

    cusparseMatDescr_t descrL = 0;
    cusparseStatus = cusparseCreateMatDescr(&descrL);
    cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_TRIANGULAR);
    cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE);
    cusparseSetMatFillMode(descrL, CUSPARSE_FILL_MODE_LOWER);
    cusparseSetMatDiagType(descrL, CUSPARSE_DIAG_TYPE_NON_UNIT);

    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, 
                                              nz,descrL, d_valsIC0, d_row, d_col, info_Lt);

    iter = 0;
    cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
	tol = Rn;
    while (r1 > tol*tol && iter <= max_iter)
    {
        // Forward Solve, we can re-use infoA since the sparsity pattern of A matches that of L
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 
					      &done,descrL,d_valsIC0, d_row, d_col, infoA, d_r, d_y);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        // Back Substitution
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, &done, 
					      descrL,d_valsIC0, d_row, d_col, info_Lt, d_y, d_zm1);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        iter++;

        if (iter == 1)
        {
            cublasDcopy(cublasHandle, N, d_zm1, 1, d_p, 1);
        }
        else
        {
            cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
            cublasDdot(cublasHandle, N, d_rm2, 1, d_zm2, 1, &denominator);
            beta = numerator/denominator;
            cublasDscal(cublasHandle, N, &beta, d_p, 1);
            cublasDaxpy(cublasHandle, N, &done, d_zm1, 1, d_p, 1) ;
        }

        cusparseDcsrmv(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz, &done, descr, d_val, 
                                  d_row, d_col, d_p, &dzero, d_omega);

        cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
        cublasDdot(cublasHandle, N, d_p, 1, d_omega, 1, &denominator);
        alpha = numerator / denominator;
        cublasDaxpy(cublasHandle, N, &alpha, d_p, 1, d_x, 1);
        cublasDcopy(cublasHandle, N, d_r, 1, d_rm2, 1);
        cublasDcopy(cublasHandle, N, d_zm1, 1, d_zm2, 1);
        nalpha = -alpha;
        cublasDaxpy(cublasHandle, N, &nalpha, d_omega, 1, d_r, 1);
        cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
    }

    printf("  iteration = %3d, residual = %e \n", iter, sqrt(r1));

With the modified storage format as required by CUSPARSE library, I tried this new program for several matrices. It is working well for small matrices. I have a SPD matrix of size 422 with 7504 entries for which the program works fine. but then, it fails with the NaN values again for a larger problem (SPD matrix of size 6487 with 117039 nonzeros).
Do you have any idea why this happens?

Many thanks

Omid

njuffa · May 23, 2013, 7:52am

Sorry, I am not an expert on CUSPARSE, but in general it does not produce NaNs from valid input, so NaN results would suggest that there may still be a problem with the inputs data.

If, after a thorough check of the all the inputs against the documented requirements, you believe there is a bug in CUSPARSE, please file a bug report using the bug reporting form linked from the registered developer website.

Topic		Replies	Views
cusparse Incomplete Cholesky CG - incorrect results GPU-Accelerated Libraries	9	3772	June 16, 2013
cuSparse incomplete LU decomposition as preconditioner GPU-Accelerated Libraries	9	2900	September 9, 2016
Problem using cusparseScsric0 function GPU-Accelerated Libraries	22	6908	November 29, 2012
CUSPARSE conversion routines not working... cusparseSnnz and cusparseSdense2csr misbehaving... CUDA Programming and Performance	11	4158	February 28, 2011
Problem with cusparseXcoo2csr and cuda-gdb CUDA Programming and Performance	9	1426	January 12, 2012
Preconditioning with cusparseSpSV CSR GPU-Accelerated Libraries cusparse	10	1528	October 12, 2021
Problem with cusparseDcsrsv_solve() function GPU-Accelerated Libraries	2	1219	July 30, 2015
cusparseScsrmv transpose mode is not working CUDA Programming and Performance	17	1500	July 9, 2018
cusparseScsrilu02 breaks with large matrices GPU-Accelerated Libraries cublas , cusolver , cusparse	10	1154	June 1, 2022
Sparse LU decomposition got wrong results for large matrices CUDA Programming and Performance	2	632	July 20, 2018

Problem with "cusparseDcsrsv"

Related topics