solving linear system with cusparse

Hey,

I try to solve a linear equation system coming from FEM algorithm with cuSparse. I use the example from the cuSparse documentation with LU decomposition (my matrix is non-symmetric) and solve the system with cusparseDcsrsm2_solve.
The problem is: I compare the solution from cuSpase with the solution calculated on CPU (Host), but the solution differs from the host solution (calculated with intel pardiso from the mkl lib).

Can anybody help me? Is there a bug in the code?

Here is my code:

cusparseStatus_t status;
	cusparseHandle_t handle 	= 0;

	cusparseMatDescr_t descr_M = 0;
	cusparseMatDescr_t descr_L = 0;
	cusparseMatDescr_t descr_U = 0;
	csrilu02Info_t info_M  = 0;
	csrsv2Info_t  info_L  = 0;
	csrsv2Info_t  info_U  = 0;
	int pBufferSize_M;
	int pBufferSize_L;
	int pBufferSize_U;
	int pBufferSize;
	void *pBuffer = 0;
	int structural_zero;
	int numerical_zero;
	const double alpha = 1.;
	const cusparseSolvePolicy_t policy_M = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
	const cusparseSolvePolicy_t policy_L = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
	const cusparseSolvePolicy_t policy_U = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
	const cusparseOperation_t trans_L  = CUSPARSE_OPERATION_NON_TRANSPOSE;
	const cusparseOperation_t trans_U  = CUSPARSE_OPERATION_NON_TRANSPOSE;

	// -- init cuSPARSE
	cusparseCreate(&handle);

	/******************************************************/
	/* SETUP DESCRIPTOR FOR MATRIX A AND LU DECOMPOSITION */
	/******************************************************/
	// - matrix M is base-1
	// - matrix L is base-1
	// - matrix L is lower triangular
	// - matrix L has unit diagonal 
	// - matrix U is base-1
	// - matrix U is upper triangular
	// - matrix U has non-unit diagonal
	cusparseCreateMatDescr(&descr_M);
	cusparseSetMatIndexBase(descr_M, CUSPARSE_INDEX_BASE_ZERO);
	cusparseSetMatType(descr_M, CUSPARSE_MATRIX_TYPE_GENERAL);

	cusparseCreateMatDescr(&descr_L);
	cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ZERO);
	cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL);
	cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER);
	cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_UNIT);

	cusparseCreateMatDescr(&descr_U);
	cusparseSetMatIndexBase(descr_U, CUSPARSE_INDEX_BASE_ZERO);
	cusparseSetMatType(descr_U, CUSPARSE_MATRIX_TYPE_GENERAL);
	cusparseSetMatFillMode(descr_U, CUSPARSE_FILL_MODE_UPPER);
	cusparseSetMatDiagType(descr_U, CUSPARSE_DIAG_TYPE_NON_UNIT);

	/*****************************************/
	/* STEP 2: create a empty info structure */
	/*****************************************/
	// we need one info for csrilu02 and two info's for csrsv2
	cusparseCreateCsrilu02Info(&info_M);
	cusparseCreateCsrsv2Info(&info_L);
	cusparseCreateCsrsv2Info(&info_U);

	/**************************************************************************************/
	/* STEP 3: query how much memory used in csrilu02 and csrsv2, and allocate the buffer */
	/**************************************************************************************/

	cusparseDcsrilu02_bufferSize(handle, ldrsd, nnz,
	    descr_M, d_acsr, d_iacsr, d_jacsr, info_M, &pBufferSize_M);
	cusparseDcsrsv2_bufferSize(handle, trans_L, ldrsd, nnz, 
	    descr_L, d_acsr, d_iacsr, d_jacsr, info_L, &pBufferSize_L);
	cusparseDcsrsv2_bufferSize(handle, trans_U, ldrsd, nnz, 
	    descr_U, d_acsr, d_iacsr, d_jacsr, info_U, &pBufferSize_U);

	pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_U));

	// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
	cudaMalloc((void**)&pBuffer, pBufferSize);

	/*******************************************************/
	// STEP 4: perform analysis of incomplete Cholesky on M
	//         perform analysis of triangular solve on L
	//         perform analysis of triangular solve on U 
	/*******************************************************/
	// The lower(upper) triangular part of M has the same sparsity pattern as L(U), 
	// we can do analysis of csrilu0 and csrsv2 simultaneously.

	cusparseDcsrilu02_analysis(handle, ldrsd, nnz, descr_M,
	    d_acsr, d_iacsr, d_jacsr, info_M, 
	    policy_M, pBuffer);
	status = cusparseXcsrilu02_zeroPivot(handle, info_M, &structural_zero);
	if (CUSPARSE_STATUS_ZERO_PIVOT == status){
	   printf("A(%d,%d) is missing\n", structural_zero, structural_zero);
	}

	cusparseDcsrsv2_analysis(handle, trans_L, ldrsd, nnz, descr_L, 
	    d_acsr, d_iacsr, d_jacsr,
	    info_L, policy_L, pBuffer);

	cusparseDcsrsv2_analysis(handle, trans_U, ldrsd, nnz, descr_U, 
	    d_acsr, d_iacsr, d_jacsr,
	    info_U, policy_U, pBuffer);

	/********************/
	// step 5: M = L * U /
	/********************/
	
	cusparseDcsrilu02(handle, ldrsd, nnz, descr_M,
	    d_acsr, d_iacsr, d_jacsr, info_M, policy_M, pBuffer);
	status = cusparseXcsrilu02_zeroPivot(handle, info_M, &numerical_zero);
	if (CUSPARSE_STATUS_ZERO_PIVOT == status){
	   printf("U(%d,%d) is zero\n", numerical_zero, numerical_zero);
	}

	/************************/
	// step 6: solve L*z = x
	/************************/
	cusparseDcsrsv2_solve(handle, trans_L, ldrsd, nnz, &alpha, descr_L,
	   d_acsr, d_iacsr, d_jacsr, info_L,
	   d_rsd, d_X, policy_L, pBuffer);

	/************************/
	// step 7: solve U*y = z
	/************************/
        double *d_z;        cudaMalloc(&d_z, ldrsd * sizeof(double));

	cusparseDcsrsv2_solve(handle, trans_U, ldrsd, nnz, &alpha, descr_U,
	   d_acsr, d_iacsr, d_jacsr, info_U,
	   d_X, d_z, policy_U, pBuffer);

	cudaFree(pBuffer);
	cusparseDestroyMatDescr(descr_M);
	cusparseDestroyMatDescr(descr_L);
	cusparseDestroyMatDescr(descr_U);
	cusparseDestroyCsrilu02Info(info_M);
	cusparseDestroyCsrsv2Info(info_L);
	cusparseDestroyCsrsv2Info(info_U);
	cusparseDestroy(handle);

except that your code does not use sm2 it is using sv2

anyway, If you want help I suggest providing a complete code, with the actual input provided that produces the erroneous result, along with what you think the correct result should be.

Furthermore, the code you have posted displays no evidence of proper error checking of any kind.

I’m sorry, that was my mistake. I would say I use cusparseDcsrsv2 routine.

After a few testing problems, I found a small problem with a 4x4 matrix, which I can also easily proof with Matlab.

The (full) matrix and the rhs have the following form (matlab syntax):

A = [395,024206111111,	39,6212291666667,	-7,99360577730113e-15,	-149,535844027778;
39,6212291666667,	87,0185660185185,	-48,7418524074074,	0;
1,77635683940025e-15,	-48,7418524074074,	174,037132037037,	6,97197916666666;
-149,535844027778,	0,	6,97197916666667,	197,512103055555 ];

b = [1,96395463125000;
-0,119086614814817;
0,178524088425921;
-2,23569157916667];

My sparse arrays have the form:

iacsr = [0, 4, 7, 11, 14];
jacsr = [0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 0, 2, 3];
acsr  = [395.02420611111091, 39.621229166666652, -7.9936057773011271e-15, -149.5358440277777, 39.621229166666652, 87.018566018518499, -48.741852407407414, 1.7763568394002505e-15, -48.741852407407414, 174.03713203703703, 6.971979166666662, -149.53584402777773, 6.9719791666666655, 197.51210305555549];

The right solution, calculated with Matlab and intel pardiso solver is:

x = [0,00112045865736888;
-0,00126705805562414;
0,00109193611565514;
-0,0105095119330718];

The solution calculated with cuSparse is:

x = [0.0013087668620991859;
-0.0035356414822175848;
0.00046073524827001814;
-0.010613151062885269]

If I solve a full system (without any zero), cuSparse calculate the same solution as Matlab and intel pardiso. Is there an error with indexing?

Ok, after proofing every Step with Matlab, I understand the problem. The incomplete LU decomposition is right and equal to Matlab. But I need a full LU decomposition because the incomplete LU-decomposition is not suitable for direct solving. Instead, it is suitable for an iterative solver.

So I’m searching for a direct solver for linear equation system for sparse arrays. I found cuSolverSp, but the problem is, that cuSolverSp expect the arrays on host memory. But my arrays are already on device memory.
Is there a direct solver available which works with arrays on device memory?

Take a look at the sample codes cuSolverSp_LowlevelQR and cuSolverSp_LowlevelCholesky

see the discussion here:

https://devtalk.nvidia.com/default/topic/1048452/gpu-accelerated-libraries/sparse-cusolver-inside-loop-factorization-at-every-call-/

For example, steps 9-14 in the QR sample code start with arrays on device memory.