solving linear system with cusparse

bischof.rob · January 14, 2019, 5:43pm

Hey,

I try to solve a linear equation system coming from FEM algorithm with cuSparse. I use the example from the cuSparse documentation with LU decomposition (my matrix is non-symmetric) and solve the system with cusparseDcsrsm2_solve.
The problem is: I compare the solution from cuSpase with the solution calculated on CPU (Host), but the solution differs from the host solution (calculated with intel pardiso from the mkl lib).

Can anybody help me? Is there a bug in the code?

Here is my code:

cusparseStatus_t status;
	cusparseHandle_t handle 	= 0;

	cusparseMatDescr_t descr_M = 0;
	cusparseMatDescr_t descr_L = 0;
	cusparseMatDescr_t descr_U = 0;
	csrilu02Info_t info_M  = 0;
	csrsv2Info_t  info_L  = 0;
	csrsv2Info_t  info_U  = 0;
	int pBufferSize_M;
	int pBufferSize_L;
	int pBufferSize_U;
	int pBufferSize;
	void *pBuffer = 0;
	int structural_zero;
	int numerical_zero;
	const double alpha = 1.;
	const cusparseSolvePolicy_t policy_M = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
	const cusparseSolvePolicy_t policy_L = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
	const cusparseSolvePolicy_t policy_U = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
	const cusparseOperation_t trans_L  = CUSPARSE_OPERATION_NON_TRANSPOSE;
	const cusparseOperation_t trans_U  = CUSPARSE_OPERATION_NON_TRANSPOSE;

	// -- init cuSPARSE
	cusparseCreate(&handle);

	/******************************************************/
	/* SETUP DESCRIPTOR FOR MATRIX A AND LU DECOMPOSITION */
	/******************************************************/
	// - matrix M is base-1
	// - matrix L is base-1
	// - matrix L is lower triangular
	// - matrix L has unit diagonal 
	// - matrix U is base-1
	// - matrix U is upper triangular
	// - matrix U has non-unit diagonal
	cusparseCreateMatDescr(&descr_M);
	cusparseSetMatIndexBase(descr_M, CUSPARSE_INDEX_BASE_ZERO);
	cusparseSetMatType(descr_M, CUSPARSE_MATRIX_TYPE_GENERAL);

	cusparseCreateMatDescr(&descr_L);
	cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ZERO);
	cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL);
	cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER);
	cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_UNIT);

	cusparseCreateMatDescr(&descr_U);
	cusparseSetMatIndexBase(descr_U, CUSPARSE_INDEX_BASE_ZERO);
	cusparseSetMatType(descr_U, CUSPARSE_MATRIX_TYPE_GENERAL);
	cusparseSetMatFillMode(descr_U, CUSPARSE_FILL_MODE_UPPER);
	cusparseSetMatDiagType(descr_U, CUSPARSE_DIAG_TYPE_NON_UNIT);

	/*****************************************/
	/* STEP 2: create a empty info structure */
	/*****************************************/
	// we need one info for csrilu02 and two info's for csrsv2
	cusparseCreateCsrilu02Info(&info_M);
	cusparseCreateCsrsv2Info(&info_L);
	cusparseCreateCsrsv2Info(&info_U);

	/**************************************************************************************/
	/* STEP 3: query how much memory used in csrilu02 and csrsv2, and allocate the buffer */
	/**************************************************************************************/

	cusparseDcsrilu02_bufferSize(handle, ldrsd, nnz,
	    descr_M, d_acsr, d_iacsr, d_jacsr, info_M, &pBufferSize_M);
	cusparseDcsrsv2_bufferSize(handle, trans_L, ldrsd, nnz, 
	    descr_L, d_acsr, d_iacsr, d_jacsr, info_L, &pBufferSize_L);
	cusparseDcsrsv2_bufferSize(handle, trans_U, ldrsd, nnz, 
	    descr_U, d_acsr, d_iacsr, d_jacsr, info_U, &pBufferSize_U);

	pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_U));

	// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
	cudaMalloc((void**)&pBuffer, pBufferSize);

	/*******************************************************/
	// STEP 4: perform analysis of incomplete Cholesky on M
	//         perform analysis of triangular solve on L
	//         perform analysis of triangular solve on U 
	/*******************************************************/
	// The lower(upper) triangular part of M has the same sparsity pattern as L(U), 
	// we can do analysis of csrilu0 and csrsv2 simultaneously.

	cusparseDcsrilu02_analysis(handle, ldrsd, nnz, descr_M,
	    d_acsr, d_iacsr, d_jacsr, info_M, 
	    policy_M, pBuffer);
	status = cusparseXcsrilu02_zeroPivot(handle, info_M, &structural_zero);
	if (CUSPARSE_STATUS_ZERO_PIVOT == status){
	   printf("A(%d,%d) is missing\n", structural_zero, structural_zero);
	}

	cusparseDcsrsv2_analysis(handle, trans_L, ldrsd, nnz, descr_L, 
	    d_acsr, d_iacsr, d_jacsr,
	    info_L, policy_L, pBuffer);

	cusparseDcsrsv2_analysis(handle, trans_U, ldrsd, nnz, descr_U, 
	    d_acsr, d_iacsr, d_jacsr,
	    info_U, policy_U, pBuffer);

	/********************/
	// step 5: M = L * U /
	/********************/
	
	cusparseDcsrilu02(handle, ldrsd, nnz, descr_M,
	    d_acsr, d_iacsr, d_jacsr, info_M, policy_M, pBuffer);
	status = cusparseXcsrilu02_zeroPivot(handle, info_M, &numerical_zero);
	if (CUSPARSE_STATUS_ZERO_PIVOT == status){
	   printf("U(%d,%d) is zero\n", numerical_zero, numerical_zero);
	}

	/************************/
	// step 6: solve L*z = x
	/************************/
	cusparseDcsrsv2_solve(handle, trans_L, ldrsd, nnz, &alpha, descr_L,
	   d_acsr, d_iacsr, d_jacsr, info_L,
	   d_rsd, d_X, policy_L, pBuffer);

	/************************/
	// step 7: solve U*y = z
	/************************/
        double *d_z;        cudaMalloc(&d_z, ldrsd * sizeof(double));

	cusparseDcsrsv2_solve(handle, trans_U, ldrsd, nnz, &alpha, descr_U,
	   d_acsr, d_iacsr, d_jacsr, info_U,
	   d_X, d_z, policy_U, pBuffer);

	cudaFree(pBuffer);
	cusparseDestroyMatDescr(descr_M);
	cusparseDestroyMatDescr(descr_L);
	cusparseDestroyMatDescr(descr_U);
	cusparseDestroyCsrilu02Info(info_M);
	cusparseDestroyCsrsv2Info(info_L);
	cusparseDestroyCsrsv2Info(info_U);
	cusparseDestroy(handle);

Robert_Crovella · January 14, 2019, 5:54pm

except that your code does not use sm2 it is using sv2

anyway, If you want help I suggest providing a complete code, with the actual input provided that produces the erroneous result, along with what you think the correct result should be.

Furthermore, the code you have posted displays no evidence of proper error checking of any kind.

bischof.rob · January 14, 2019, 10:35pm

I’m sorry, that was my mistake. I would say I use cusparseDcsrsv2 routine.

After a few testing problems, I found a small problem with a 4x4 matrix, which I can also easily proof with Matlab.

The (full) matrix and the rhs have the following form (matlab syntax):

A = [395,024206111111,	39,6212291666667,	-7,99360577730113e-15,	-149,535844027778;
39,6212291666667,	87,0185660185185,	-48,7418524074074,	0;
1,77635683940025e-15,	-48,7418524074074,	174,037132037037,	6,97197916666666;
-149,535844027778,	0,	6,97197916666667,	197,512103055555 ];

b = [1,96395463125000;
-0,119086614814817;
0,178524088425921;
-2,23569157916667];

My sparse arrays have the form:

iacsr = [0, 4, 7, 11, 14];
jacsr = [0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 0, 2, 3];
acsr  = [395.02420611111091, 39.621229166666652, -7.9936057773011271e-15, -149.5358440277777, 39.621229166666652, 87.018566018518499, -48.741852407407414, 1.7763568394002505e-15, -48.741852407407414, 174.03713203703703, 6.971979166666662, -149.53584402777773, 6.9719791666666655, 197.51210305555549];

The right solution, calculated with Matlab and intel pardiso solver is:

x = [0,00112045865736888;
-0,00126705805562414;
0,00109193611565514;
-0,0105095119330718];

The solution calculated with cuSparse is:

x = [0.0013087668620991859;
-0.0035356414822175848;
0.00046073524827001814;
-0.010613151062885269]

If I solve a full system (without any zero), cuSparse calculate the same solution as Matlab and intel pardiso. Is there an error with indexing?

bischof.rob · January 15, 2019, 8:16am

Ok, after proofing every Step with Matlab, I understand the problem. The incomplete LU decomposition is right and equal to Matlab. But I need a full LU decomposition because the incomplete LU-decomposition is not suitable for direct solving. Instead, it is suitable for an iterative solver.

So I’m searching for a direct solver for linear equation system for sparse arrays. I found cuSolverSp, but the problem is, that cuSolverSp expect the arrays on host memory. But my arrays are already on device memory.
Is there a direct solver available which works with arrays on device memory?

Robert_Crovella · March 13, 2019, 3:03am

Take a look at the sample codes cuSolverSp_LowlevelQR and cuSolverSp_LowlevelCholesky

see the discussion here:

[url]https://devtalk.nvidia.com/default/topic/1048452/gpu-accelerated-libraries/sparse-cusolver-inside-loop-factorization-at-every-call-/[/url]

For example, steps 9-14 in the QR sample code start with arrays on device memory.

Topic		Replies	Views
cusparseXcsrilu02_zeroPivot returns cusparse_status_internal_error everytime GPU-Accelerated Libraries cusparse	3	1415	November 17, 2021
CuSPARSE: Solver with LU decomposition returns wrong answer, maybe fill-in problem? GPU-Accelerated Libraries	3	2410	August 6, 2014
cusparseScsrilu02 breaks with large matrices GPU-Accelerated Libraries cublas , cusolver , cusparse	10	1158	June 1, 2022
Sparse LU decomposition got wrong results for large matrices CUDA Programming and Performance	2	634	July 20, 2018
cuSOLVER solving a overdetermined linear equation system GPU-Accelerated Libraries cusolver	8	1266	September 19, 2021
cuSPARSE BSR Matrix Solver GPU-Accelerated Libraries cuda , cusolver , cusparse	2	66	October 23, 2024
cuSparse incomplete LU decomposition as preconditioner GPU-Accelerated Libraries	9	2907	September 9, 2016
cuSPARSE LU decomposition error GPU-Accelerated Libraries	0	429	July 8, 2019
Problem with cusparseDcsrsv_solve() function GPU-Accelerated Libraries	2	1224	July 30, 2015
example of sparse LL^t or QR functionality GPU-Accelerated Libraries	2	1264	April 3, 2015

solving linear system with cusparse

Related topics