I coded a Preconditioned Conjugate Gradient solver (for linear system of equations) with Incomplete Cholesky preconditioning using CUBLAS and CUSPARSE libraries. I used Dr. Maxim Naumov’s papers on nvidia’s research community as a guideline. I have the matrix in CSR format and the lower triangular Cholesky factor is also obtained on CPU and is sent and present in GPU before the PCG procedure.

The related code is as follows:

```
/* Create info objects for the IC0 preconditioner */
cusparseSolveAnalysisInfo_t info_L;
cusparseCreateSolveAnalysisInfo(&info_L);
cusparseSolveAnalysisInfo_t info_Lt;
cusparseCreateSolveAnalysisInfo(&info_Lt);
cusparseMatDescr_t descrL = 0;
cusparseStatus = cusparseCreateMatDescr(&descrL);
cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_TRIANGULAR);
cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatFillMode(descrL, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descrL, CUSPARSE_DIAG_TYPE_NON_UNIT);
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N,
lnz,descrL, d_valsIC0, d_rowIC0, d_colIC0, info_L);
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N,
lnz, descrL, d_valsIC0, d_rowIC0, d_colIC0, info_Lt);
iter = 0;
cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
tol = Rn;
while (r1 > tol*tol && iter <= max_iter)
{
// Forward Solve, we can re-use infoA since the sparsity pattern of A matches that of L
cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N,
&done, descrL,d_valsIC0, d_rowIC0, d_colIC0, info_L, d_r, d_y);
if (checkCudaErrors(cusparseStatus))
{
exit(EXIT_FAILURE);
}
// Back Substitution
cusparseStatus = cusparseDcsrsv_solve(cusparseHandle, CUSPARSE_OPERATION_TRANSPOSE, N, &done,
descrL,d_valsIC0, d_rowIC0, d_colIC0, info_Lt, d_y, d_zm1);
if (checkCudaErrors(cusparseStatus))
{
exit(EXIT_FAILURE);
}
iter++;
if (iter == 1)
{
cublasDcopy(cublasHandle, N, d_zm1, 1, d_p, 1);
}
else
{
cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
cublasDdot(cublasHandle, N, d_rm2, 1, d_zm2, 1, &denominator);
beta = numerator/denominator;
cublasDscal(cublasHandle, N, &beta, d_p, 1);
cublasDaxpy(cublasHandle, N, &done, d_zm1, 1, d_p, 1) ;
}
cusparseDcsrmv(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz, &done, descr, d_val,
d_row, d_col, d_p, &dzero, d_omega);
cublasDdot(cublasHandle, N, d_r, 1, d_zm1, 1, &numerator);
cublasDdot(cublasHandle, N, d_p, 1, d_omega, 1, &denominator);
alpha = numerator / denominator;
cublasDaxpy(cublasHandle, N, &alpha, d_p, 1, d_x, 1);
cublasDcopy(cublasHandle, N, d_r, 1, d_rm2, 1);
cublasDcopy(cublasHandle, N, d_zm1, 1, d_zm2, 1);
nalpha = -alpha;
cublasDaxpy(cublasHandle, N, &nalpha, d_omega, 1, d_r, 1);
cublasDdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
}
printf(" iteration = %3d, residual = %e \n", iter, sqrt(r1));
```

The matrix is stored on GPU in: d_val, d_row(), d_col, with dimension N and number of nonzeros nz.

The lower-triangular Cholesky factor is stored on GPU in: d_valIC0, d_rowIC0, d_colIC0 with the number of nonzeros to be lnz.

When I run the program, I get the following at the very first iteration: “residual = 1.#QNAN0e+000”.

I tested the data with stand-alone CG and it worked well. Also, it works well on CPU using MKL.

my guess is that there is a problem with “cusparseDcsrsv_solve”, yet I am not sure. I have also tried other configs for the matrix description, but it failed with the same error all the time. could anybody help me with this issue?

Note: my platform is windows 7-64bit - I am using CUDA 5.0 and my GPU is Tesla k20.

Many thanks,

Omid