Problem with "cusparseDcsrsv"

I coded a Preconditioned Conjugate Gradient solver (for linear system of equations) with Incomplete Cholesky preconditioning using CUBLAS and CUSPARSE libraries. I used Dr. Maxim Naumov’s papers on nvidia’s research community as a guideline. I have the matrix in CSR format and the lower triangular Cholesky factor is also obtained on CPU and is sent and present in GPU before the PCG procedure.
The related code is as follows:

/* Create info objects for the IC0 preconditioner */
    cusparseSolveAnalysisInfo_t info_L;
    cusparseCreateSolveAnalysisInfo(&info_L);

	cusparseSolveAnalysisInfo_t info_Lt;
    cusparseCreateSolveAnalysisInfo(&info_Lt);

    cusparseMatDescr_t descrL = 0;
    cusparseStatus = cusparseCreateMatDescr(&descrL);
    cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_TRIANGULAR);
    cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE);
    cusparseSetMatFillMode(descrL, CUSPARSE_FILL_MODE_LOWER);
    cusparseSetMatDiagType(descrL, CUSPARSE_DIAG_TYPE_NON_UNIT);

    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 
					         lnz,descrL, d_valsIC0, d_rowIC0, d_colIC0, info_L);

    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, 
						lnz, descrL, d_valsIC0, d_rowIC0, d_colIC0, info_Lt);

    iter = 0;
    cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
    tol = Rn;
    while (r1 > tol*tol && iter <= max_iter)
    {
        // Forward Solve, we can re-use infoA since the sparsity pattern of A matches that of L
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 
				      &done, descrL,d_valsIC0, d_rowIC0, d_colIC0, info_L, d_r, d_y);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        // Back Substitution
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, &done, 
					   descrL,d_valsIC0, d_rowIC0, d_colIC0, info_Lt, d_y, d_zm1);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        iter++;

        if (iter == 1)
        {
            cublasDcopy(cublasHandle, N, d_zm1, 1, d_p, 1);
        }
        else
        {
            cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
            cublasDdot(cublasHandle, N, d_rm2, 1, d_zm2, 1, &denominator);
            beta = numerator/denominator;
            cublasDscal(cublasHandle, N, &beta, d_p, 1);
            cublasDaxpy(cublasHandle, N, &done, d_zm1, 1, d_p, 1) ;
        }

        cusparseDcsrmv(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz, &done, descr, d_val, 
                                        d_row, d_col, d_p, &dzero, d_omega);

        cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
        cublasDdot(cublasHandle, N, d_p, 1, d_omega, 1, &denominator);
        alpha = numerator / denominator;
        cublasDaxpy(cublasHandle, N, &alpha, d_p, 1, d_x, 1);
        cublasDcopy(cublasHandle, N, d_r, 1, d_rm2, 1);
        cublasDcopy(cublasHandle, N, d_zm1, 1, d_zm2, 1);
        nalpha = -alpha;
        cublasDaxpy(cublasHandle, N, &nalpha, d_omega, 1, d_r, 1);
        cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
    }

    printf("  iteration = %3d, residual = %e \n", iter, sqrt(r1));

The matrix is stored on GPU in: d_val, d_row(), d_col, with dimension N and number of nonzeros nz.
The lower-triangular Cholesky factor is stored on GPU in: d_valIC0, d_rowIC0, d_colIC0 with the number of nonzeros to be lnz.

When I run the program, I get the following at the very first iteration: “residual = 1.#QNAN0e+000”.

I tested the data with stand-alone CG and it worked well. Also, it works well on CPU using MKL.
my guess is that there is a problem with “cusparseDcsrsv_solve”, yet I am not sure. I have also tried other configs for the matrix description, but it failed with the same error all the time. could anybody help me with this issue?

Note: my platform is windows 7-64bit - I am using CUDA 5.0 and my GPU is Tesla k20.

Many thanks,

Omid

Have you considered the possibility of a mismatch between the matrix descriptor settings and the actual matrix? What is the smallest matrix that reproduces the problem? Could you show the matrix or a least the first 10 rows of it?

Hi njuffa,

Thanks for your reply.
Yes, I have double checked the matrix and also the IC factor. I noticed that it is mentioned in the cusparse manual that
“Sparse matrices in CSR format are assumed to be stored in row-major CSR format, in other words, the index arrays are first sorted by row indices and then within the same row by column indices.”
My matrices (both the coefficient matrix and the IC factor) do not exactly match these criteria (the order of column indices in each row is not increasing).
would this be the reason of my problem?
If so, how come the Cg without preconditioning is working fine in spite of my wrong matrix storage?

Thank you,

Omid

Your matrix must follow the ordering requirements spelled out by the documentation exactly. Components of the solver code that come into play when using the preconditioner rely on this ordering. Violating the ordering requirements results in the NaNs you are seeing. Without the preconditioner, the solver code may “happen to work” when the ordering isn’t quite following the specification since some solver components that rely on it are not used.

In general it is a good idea to follow the requirements spelled out in documentation. The fact that some code may work by chance if some requirements are not met does not change this basic fact.

Thanks for your help. I got the program working fine. As you mentioned, the problem was with the storage format of the matrices.
In my previous program, I computed the cholesky factor on the cpu and then send it to the gpu to be used with cusparse PCG algorithm.

No, I would like to use the ic0 preconditioning routine available in cusparse 5.0, ie cusparseDcsric.
I coded the program as follows:

/* create the analysis info object for the A matrix */
    cusparseSolveAnalysisInfo_t infoA = 0;
    cusparseStatus = cusparseCreateSolveAnalysisInfo(&infoA);

    if (checkCudaErrors(cusparseStatus)) return EXIT_FAILURE;

    /* Perform the analysis for the Non-Transpose case */
    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                                             N, nz, descr, d_val, d_row, d_col, infoA);

    if (checkCudaErrors(cusparseStatus))
    {
        exit(EXIT_FAILURE);
    }

    /* Copy A data to IC0 vals as input*/
    cudaMemcpy(d_valsIC0, d_val, nz*sizeof(double), cudaMemcpyDeviceToDevice);

    /* generate the Incomplete LU factor H for the matrix A using cudsparseScsrilu0 */
    cusparseStatus = cusparseDcsric0(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, descr, 
                                     d_valsIC0, d_row, d_col, infoA);

    if (checkCudaErrors(cusparseStatus))
    {
        exit(EXIT_FAILURE);
    }

    /* Create info objects for the IC0 preconditioner */
    cusparseSolveAnalysisInfo_t info_Lt;
    cusparseCreateSolveAnalysisInfo(&info_Lt);

    cusparseMatDescr_t descrL = 0;
    cusparseStatus = cusparseCreateMatDescr(&descrL);
    cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_TRIANGULAR);
    cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE);
    cusparseSetMatFillMode(descrL, CUSPARSE_FILL_MODE_LOWER);
    cusparseSetMatDiagType(descrL, CUSPARSE_DIAG_TYPE_NON_UNIT);

    cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, 
                                              nz,descrL, d_valsIC0, d_row, d_col, info_Lt);

    iter = 0;
    cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
	tol = Rn;
    while (r1 > tol*tol && iter <= max_iter)
    {
        // Forward Solve, we can re-use infoA since the sparsity pattern of A matches that of L
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, 
					      &done,descrL,d_valsIC0, d_row, d_col, infoA, d_r, d_y);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        // Back Substitution
        cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, &done, 
					      descrL,d_valsIC0, d_row, d_col, info_Lt, d_y, d_zm1);

        if (checkCudaErrors(cusparseStatus))
        {
            exit(EXIT_FAILURE);
        }

        iter++;

        if (iter == 1)
        {
            cublasDcopy(cublasHandle, N, d_zm1, 1, d_p, 1);
        }
        else
        {
            cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
            cublasDdot(cublasHandle, N, d_rm2, 1, d_zm2, 1, &denominator);
            beta = numerator/denominator;
            cublasDscal(cublasHandle, N, &beta, d_p, 1);
            cublasDaxpy(cublasHandle, N, &done, d_zm1, 1, d_p, 1) ;
        }

        cusparseDcsrmv(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz, &done, descr, d_val, 
                                  d_row, d_col, d_p, &dzero, d_omega);

        cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
        cublasDdot(cublasHandle, N, d_p, 1, d_omega, 1, &denominator);
        alpha = numerator / denominator;
        cublasDaxpy(cublasHandle, N, &alpha, d_p, 1, d_x, 1);
        cublasDcopy(cublasHandle, N, d_r, 1, d_rm2, 1);
        cublasDcopy(cublasHandle, N, d_zm1, 1, d_zm2, 1);
        nalpha = -alpha;
        cublasDaxpy(cublasHandle, N, &nalpha, d_omega, 1, d_r, 1);
        cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
    }

    printf("  iteration = %3d, residual = %e \n", iter, sqrt(r1));

With the modified storage format as required by CUSPARSE library, I tried this new program for several matrices. It is working well for small matrices. I have a SPD matrix of size 422 with 7504 entries for which the program works fine. but then, it fails with the NaN values again for a larger problem (SPD matrix of size 6487 with 117039 nonzeros).
Do you have any idea why this happens?

Many thanks

Omid

Sorry, I am not an expert on CUSPARSE, but in general it does not produce NaNs from valid input, so NaN results would suggest that there may still be a problem with the inputs data.

If, after a thorough check of the all the inputs against the documented requirements, you believe there is a bug in CUSPARSE, please file a bug report using the bug reporting form linked from the registered developer website.