Some questions for the function "cusparseDcsrsv2_solve"

Hello,
I want to solve the linear system of equations which is a dense matrix. I note that some library functions can be used, such as “cusparseDcsrsv2_solve” listed in
CUDA Fortran Programming Guide Version 24.7 for ARM, OpenPower, x86 (nvidia.com). However, for the other manual listed in cuSPARSE Library (nvidia.com), I don not find this function “cusparseDcsrsv2_solve”. I think that both of manuals are publeshed from NVIDIA, but I don not know how to use them. Because there are some details of functions, but the CUDA Fortran manual has some simplified descriptions.
Now, I still try to use “Dcsrsv2_solve” correctly. Unfortunately, I get some errors, such as “NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure cusparsedcsrsv2_solve”.
There are partial codes that I used:
!Copyright (c) 2022 by LEEE under guide of Huaifeng Sun(sunhuaifeng@gmail.com)
!written by Shangbin Liu(lsbin87@126.com)

! --------------------------------Subroutine part---------------------------------------------!
subroutine ITERATION_CDM_CUSPARSE_LU
USE CONSTANTPARAMETERS
USE ELECTROMAGNETIC_VARIABLES
USE RES_MODEL_PARAMETER
USE TIME_PARAMETERS
use openacc
use cudafor
USE CUSPARSE
IMPLICIT NONE

TYPE(cusparseHandle) :: handle5
TYPE(cusparseCsrsv2Info) :: info
integer(4) :: transA, m, nnz,policy
TYPE(cusparseMatDescr):: descrA
REAL(KIND=8)::csrValA(14)
INTEGER(KIND=4) :: csrRowPtrA(5), csrColIndA(14)
INTEGER(KIND=4):: istat
INTEGER(KIND=4) :: pBuffer 
INTEGER(1), pointer  :: buffer(:)
real(8) :: x(4),y(4), alpha
INTEGER(4)::i

istat = cusparseCreate(handle5)
istat = cusparseCreateMatDescr(descrA)
istat = cusparseSetMatType(descrA,CUSPARSE_MATRIX_TYPE_GENERAL)
istat = cusparseSetMatIndexBase(descrA,CUSPARSE_INDEX_BASE_ONE)
istat = cusparseSetStream(handle5,acc_get_cuda_stream(33)) 

m=4
nnz=14
alpha=1.0D0
transA=0
policy=0

!$acc data create(handle5, transA, m, nnz, descrA, csrValA, csrRowPtrA, csrColIndA, info,policy, pBuffer)

istat = cusparseCreateCsrsv2Info(info)
istat = cusparseDcsrsv2_bufferSize(handle5, transA, m, nnz, descrA, csrValA, csrRowPtrA, csrColIndA, info, pBuffer)
istat = cusparseDcsrsv2_analysis(handle5, transA, m, nnz, descrA, csrValA, csrRowPtrA, csrColIndA, info, policy, pBuffer)
ALLOCATE(buffer(pBuffer ) )

!$acc end data

!=========================The test Region1=========================!

!$acc data copyin(handle5, transA, m, nnz, alpha, descrA, csrValA, csrRowPtrA, csrColIndA, info, x, y, policy, pBuffer)
write(*,*)"++++"
data csrValA / 1, 8, 13, 5, 2, 9, 14, 11, 6, 3, 10, 12, 7, 4/
data csrRowPtrA / 1, 4, 8, 12, 15/ 
data csrColIndA / 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4/
data y /1,2,3,4/
istat = cusparseDcsrsv2_solve(handle5, transA, m, nnz, alpha, descrA, csrValA, csrRowPtrA, csrColIndA, info, x, y, policy, pBuffer)

!$acc update host(x)
do i=1, 4
	write(*,*)x(i)
enddo

pause

!=========================The test Region1=========================!

!$acc end data

end subroutine ITERATION_CDM_CUSPARSE_LU

Hi zhaoqi_326326,

The Fortran CUDA Interfaces guide simply provides the documentation on the Fortran interfaces to the CUDA Libraries. It’s used in combination with the cuSolver documentation, which provides information about what each routine does.

Given “cusparseDcsrsv2_solve” was deprecated in the CUDA 11 version of cuSolver and removed in the CUDA 12 version, you may consider moving to using the cuSolver Generic API instead.

As the Fortran interface guide can be used with multiple version of cuSparse, the interface to “cusparseDcsrsv2_solve” is still available provided you link with CUDA 11.8 or earlier (i.e. add the flag"-gpu=cuda11.8").

For the errors you’re getting, I can’t be sure since the example code is incomplete due the missing modules so I can compile it. But it’s likely because you’re passing host data for the variables that are expecting device data (those with the “device” attribute as shown in the interface guide).

To pass device data with OpenACC managed data, you’ll want to use the “host_data use_device” direcitve, which says to use device data on the host. Something like:

!$acc host_data use_device(alpha, csrValA, csrRowPtrA, csrColIndA, pBuffer)
stat = cusparseDcsrsv2_solve(handle5, transA, m, nnz, alpha, descrA, csrValA, csrRowPtrA, csrColIndA, info, x, y, policy, pBuffer)
!$acc end host_data

Again since I can compile your code, I haven’t tested this so I could be off a bit, but hopefully it points you in the correct direction.

-Mat

Thanks for your suggestions, I will reference the section “Chapter 14. cuSPARSE Generic API Reference” instead the function Dcsrsv2_solve. But I have some problems with the context that we are discussing. I must emphasize the terms, like alpha, csrValA, csrRowPtrA, csrColIndA and pBuffer, on the device? For the openacc, I usually define the terms on the host, then the “copyin, copy, or create” command will be used to transmit terms form the host to device. For example, I have successfully applied the function “cusparseDgpsvInterleavedBatch” to solve penta-diagonal systems, as following:

INTEGER(KIND=4):: m_test,batchcount_test
REAL(KIND=8)::A(2,4,-1:4+2),B(2,4,4),A_test(8),B_test(8),C_test(8),D_test(8),E_test(8),F_test(8)

write(*,*) "*******************This Is Subroutine Iteration*******************************"


istat = cusparseCreate(handle4)
istat = cusparseSetStream(handle4,acc_get_cuda_stream(22)) 
m_test=4;batchcount_test=2

!$acc data create(A_test,B_test,C_test,D_test,E_test,F_test)

istat = cusparseDgpsvInterleavedBatch_bufferSizeExt(handle4,algo,m_test,A_test,B_test,C_test,D_test,E_test,F_test,batchcount_test,bufferSizeInBytes_test)
ALLOCATE(buffer_test(bufferSizeInBytes_test ) )
!$acc end data

!$acc data copyin(A_test,B_test,C_test,D_test,E_test,F_test)

!=========================The test Region=========================!
! float h_S = {0, 0, 11, 12, 0, 0, 25, 26};
! float h_L = {0, 5, 6, 7, 0, 19, 20, 21};
! float h_M = {1, 2, 3, 4, 15, 16, 17, 18};
! float h_U = {8, 9, 10, 0, 22, 23, 24, 0};
! float h_W = {13, 14, 0, 0, 27, 28, 0, 0};
! float h_B = {1, 2, 3, 4, 5, 6, 7, 8};
A(1,1,-1)=0; A(1,1,0 )=0; A(1,1,1)=1; A(1,1,2)=8; A(1,1,3)=13; B(1,1,1)=1
A(1,2,0 )=0; A(1,2,1)=5; A(1,2,2)=2; A(1,2,3)=9; A(1,2,4)=14; B(1,2,2)=2
A(1,3,1 )=11; A(1,3,2)=6; A(1,3,3)=3; A(1,3,4)=10;A(1,3,5)=0; B(1,3,3)=3
A(1,4,2 )=12; A(1,4,3)=7; A(1,4,4)=4; A(1,4,5)=0; A(1,4,6)=0; B(1,4,4)=4

A(2,1,-1)=0;  A(2,1,0 )=0;  A(2,1,1)=15; A(2,1,2)=22; A(2,1,3)=27; B(2,1,1)=5
A(2,2,0 )=0;  A(2,2,1)=19;  A(2,2,2)=16; A(2,2,3)=23; A(2,2,4)=28; B(2,2,2)=6
A(2,3,1 )=25; A(2,3,2)=20;  A(2,3,3)=17; A(2,3,4)=24; A(2,3,5)=0;  B(2,3,3)=7
A(2,4,2 )=26; A(2,4,3)=21;  A(2,4,4)=18; A(2,4,5)=0;  A(2,4,6)=0;  B(2,4,4)=8

do k=1,4
	do I=1,2

		Coeff_AZ_1_Num = i
		Coeff_EX_1_Num=2*(k-1)+i
		
		A_test(Coeff_EX_1_Num) = A(Coeff_AZ_1_Num, k, k-2)
		B_test(Coeff_EX_1_Num) = A(Coeff_AZ_1_Num, k, k-1)
		C_test(Coeff_EX_1_Num) = A(Coeff_AZ_1_Num, k, k+0)
		D_test(Coeff_EX_1_Num) = A(Coeff_AZ_1_Num, k, k+1)
		E_test(Coeff_EX_1_Num) = A(Coeff_AZ_1_Num, k, k+2)
		F_test(Coeff_EX_1_Num) = B(Coeff_AZ_1_Num, k,k)		

	enddo
enddo
write(*,"(8f)")A_test(1),A_test(2),A_test(3),A_test(4),A_test(5),A_test(6),A_test(7),A_test(8)
write(*,"(8f)")B_test(1),B_test(2),B_test(3),B_test(4),B_test(5),B_test(6),B_test(7),B_test(8)
write(*,"(8f)")C_test(1),C_test(2),C_test(3),C_test(4),C_test(5),C_test(6),C_test(7),C_test(8)
write(*,"(8f)")D_test(1),D_test(2),D_test(3),D_test(4),D_test(5),D_test(6),D_test(7),D_test(8)
write(*,"(8f)")E_test(1),E_test(2),E_test(3),E_test(4),E_test(5),E_test(6),E_test(7),E_test(8)
write(*,"(8f)")F_test(1),F_test(2),F_test(3),F_test(4),F_test(5),F_test(6),F_test(7),F_test(8)

! data A_test / 0, 0, 0, 0,11,25,12,26/
! data B_test / 0, 0, 5,19, 6,20, 7,21/
! data C_test / 1,15, 2,16, 3,17, 4,18/
! data D_test / 8,22, 9,23,10,24, 0, 0/
! data E_test /66,27,14,28, 0, 0, 0, 0/
! data F_test / 1, 5, 2, 6, 3, 7, 4, 8/

!$acc update device(A_test,B_test,C_test,D_test,E_test,F_test)

istat = cusparseDgpsvInterleavedBatch(handle4,algo,m_test,A_test,B_test,C_test,D_test,E_test,F_test,batchcount_test,buffer_test)
!$acc update host(F_test)
do i=1, 8
	write(*,*)F_test(i)
enddo
pause

!=========================The test Region=========================!

I use the copyin to deal with the terms I used on the device. Of course, you provided a nice example in the website “CUDALibrarySamples/cuSPARSE/gpsvInterleavedBatch at master · NVIDIA/CUDALibrarySamples · GitHub”.
While we are at there, I mention some peoblems I met by the way. In the guidline, you say that the data layout is the same as gtsvStridedBatch. However, I testd that the data layout is the same as the one from function “cusparseDgtsvInterleavedBatch” where you can find “The data layout is different from gtsvStridedBatch which aggregates all matrices one after another”

I found the function I need, maybe is the “cusparseSpSV()”. The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation. This means that I can not input the sparse matrix with two by two? I need an operation of LU by other functions? I am confused with the follow:

‣ Set matrix data layouts, number of batches, and storage formats (for example, CSR, COO, and so on)
‣ Set input/output/compute data types. This also allows mixed data-type computation
‣ Set types of sparse matrix indices ‣ Choose the algorithm for the computation
‣ Provide external device memory for internal operations
‣ Provide extensive consistency checks across input matrices and vectors for a given routine. This includes the validation of matrix sizes, data types, layout, allowed operations, etc.
‣ Provide constant descriptors for vector and matrix inputs to support const-safe interface and guarantee that the APIs do not modify their inputs.

I really hope you can give me a flow or an example that can guide me to solve a linear system of equations.

I’m not sure these example have what you want, and they are written in CUDA C so you’d need to translate them, but take a look through the CUDA Libraries examples at: cuda-samples/Samples/4_CUDA_Libraries at master · NVIDIA/cuda-samples · GitHub

There’s also the CUDA Libraries Forum if you have specific questions: GPU-Accelerated Libraries - NVIDIA Developer Forums

Thank you for your help again.

I achieve the solution of linear system of equations, we can obtain answers [1,2,3,4] by folowing codes:

! !=========================The initial parameters=========================!
	use openacc
	use cudafor
	USE CUSPARSE
	IMPLICIT NONE

	type(cusparseHandle):: handle
	type(cusparseSpMatDescr):: matA
	type(cusparseDnVecDescr):: vecX, vecY
	type(cusparseSpSVDescr):: spsvDescr
	INTEGER(KIND=8)::rows 				!Number of rows of the sparse matrix
	INTEGER(KIND=8)::cols 				!Number of columns of the sparse matrix
	INTEGER(KIND=8)::nnz  				!Number of non-zero entries of the sparse matrix
	INTEGER(KIND=4), DIMENSION(:), ALLOCATABLE :: csrRowOffsets !Row offsets of the sparse matrix. Array of size rows + 1
	INTEGER(KIND=4), DIMENSION(:), ALLOCATABLE :: csrColInd		!Column indices of the sparse matrix. Array of size nnz
	REAL(KIND=4), DIMENSION(:), ALLOCATABLE :: csrValues		!Values of the sparse matrix. Array of size nnz
	REAL(KIND=4), DIMENSION(:), ALLOCATABLE :: dX				!Values of the dense vector. Array of size cols
	REAL(KIND=4), DIMENSION(:), ALLOCATABLE :: dY				!Values of the dense vector. Array of size rows
	INTEGER(KIND=8)::pBuffer
	INTEGER(KIND=4), DIMENSION(:), ALLOCATABLE :: buffer
	! REAL(KIND=4), pointer :: buffer(:)
	type(c_devptr) :: bufferPtr
	REAL(KIND=4)::alpha
	INTEGER(KIND=4):: istat,batchCount,i
	! TYPE(cusparseMatDescr):: descrA

	! istat = cusparseCreateMatDescr(descrA)
	! istat = cusparseSetMatType(descrA,CUSPARSE_MATRIX_TYPE_GENERAL)
	! istat = cusparseSetMatIndexBase(descrA,CUSPARSE_INDEX_BASE_ONE)
	
	rows = 4
	cols = 4
	nnz  = 9
	batchCount=1
	alpha=1
	ALLOCATE(csrRowOffsets(rows+1),csrColInd(nnz),csrValues(nnz),dX(cols),dY(rows))
	csrRowOffsets	= (/0, 3, 4, 7, 9/)
	csrColInd 	 	= (/0, 2, 3, 1, 0, 2, 3, 1, 3/)
	csrValues 		= (/1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0/)
	dX				= (/1.0, 8.0, 23.0, 52.0/)
	dY				= (/0.0, 0.0, 0.0, 100.0/)
	ALLOCATE(buffer(1344))

    !$acc data copyin(csrRowOffsets,csrColInd,csrValues,dX,dY,alpha,buffer)
	
	istat = cusparseCreate(handle)
	istat = cusparseSetStream(handle,acc_get_cuda_stream(33)) 

	istat = cusparseCreateCsr(matA,rows,cols,nnz,csrRowOffsets,csrColInd,csrValues,CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I,CUSPARSE_INDEX_BASE_ZERO,CUDA_R_32F)
	istat = cusparseCreateDnVec(vecX, cols, dX, CUDA_R_32F)
	istat = cusparseCreateDnVec(vecY, rows, dY, CUDA_R_32F)
	istat = cusparseSpSV_createDescr(spsvDescr)
	istat = cusparseSpMatSetAttribute(matA,CUSPARSE_SPMAT_FILL_MODE,CUSPARSE_FILL_MODE_LOWER,c_sizeof(CUSPARSE_SPMAT_FILL_MODE))
	istat = cusparseSpMatSetAttribute(matA,CUSPARSE_SPMAT_DIAG_TYPE,CUSPARSE_DIAG_TYPE_NON_UNIT,c_sizeof(CUSPARSE_SPMAT_DIAG_TYPE))
	istat = cusparseSpSV_bufferSize(handle,CUSPARSE_OPERATION_NON_TRANSPOSE,alpha,matA,vecX,vecY,CUDA_R_32F,CUSPARSE_SPSV_ALG_DEFAULT,spsvDescr,pBuffer)

	! ALLOCATE(buffer(pBuffer))
	! buffer = acc_malloc(1344)
! !=========================The test Region1=========================!
	
	istat = cusparseSpSV_analysis(handle,CUSPARSE_OPERATION_NON_TRANSPOSE,alpha,matA,vecX,vecY,CUDA_R_32F,CUSPARSE_SPSV_ALG_DEFAULT,spsvDescr,buffer)
	istat = cusparseSpSV_solve(handle,CUSPARSE_OPERATION_NON_TRANSPOSE,alpha,matA,vecX,vecY,CUDA_R_32F,CUSPARSE_SPSV_ALG_DEFAULT,spsvDescr)

	istat = cusparseDestroySpMat(matA)
    istat = cusparseDestroyDnVec(vecX)
    istat = cusparseDestroyDnVec(vecY)
    istat = cusparseSpSV_destroyDescr(spsvDescr)
    istat = cusparseDestroy(handle)

	!$acc update host(dX,dY)

	!$acc end data
	do i=1, 4
		write(*,*)dX(i),dY(i)
	enddo
!=========================The test Region1=========================!

However, we can see some unreasonable codes, such as “ALLOCATE(buffer(1344))”. The 1344 is the size of the workspace, which is calculated by “cusparseSpSV_bufferSize”. So, It should be placed after “cusparseSpSV_bufferSize”, that is the “! ALLOCATE(buffer(pBuffer))”. But between “!acc data copyin” and “!acc end data”, I can not transmit the array “buffer” to the device from the host. I find that the “acc_malloc” function can return a device pointer, in a variable of type(c_devptr), to newly allocated memory on the device derived from OpenACC Getting Started Guide Version 24.7 for ARM, OpenPower, x86 (nvidia.com). It provides less information except for “type(c_devptr) function acc_malloc (bytes)”. Could you provide more help for me.

Thanks.

Data regions can be nested, so you can add another region after “buffer” has been allocated with the known size.

Alternatively, you can use the CUDA Fortran “device” attribute on “buffer” to make it a device-only. If you do this, then add the “-cuda” flag to enable CUDA Fortran.

Personally I would avoid using acc_malloc here since it returns a C-style pointer which you then need to use the iso_c_binding C_F_POINTER intrinsic to convert it into a Fortran allocatable or pointer array. Possible, but the two previous methods are simpler.

Thanks for your help. I address the problem I’m confused.

  • Zhao

Now, I apply this function to my algorithm. In my algorithm, I use the “kernels” to achieve the parallel calculation. There are three loops in a kernel, where the internal loop will only generate the matrix that should be marked the “!$acc loop seq”. So, I should put the SOLVER behind it, i.e. the internal loop. Then, we the number of linear system of equations is NX*NY (shown as following codes), and they should be calculated parallelly.

istat = cusparseCreate(handle)
istat = cusparseSetStream(handle,acc_get_cuda_stream(33))

!============= Create dense matrix A in CSR format =============
istat = cusparseCreateDnMat(EX_1_MATA,EX_1_ROW,EX_1_COL,EX_1_LD,EX_1_DENSE,CUDA_R_32F,CUSPARSE_ORDER_ROW)
!============= Create sparse matrix B in CSR format =============
istat = cusparseCreateCsr(EX_1_MATB,EX_1_ROW,EX_1_COL,0,csrRowOffsets,0,0,CUSPARSE_INDEX_32I,CUSPARSE_INDEX_32I,CUSPARSE_INDEX_BASE_ONE,CUDA_R_32F)
!============= Allocate an external buffer if needed =============
istat = cusparseDenseToSparse_bufferSize(handle, EX_1_MATA, EX_1_MATB, CUSPARSE_DENSETOSPARSE_ALG_DEFAULT, EX_1_pBuffer)
ALLOCATE(EX_1_buffer(EX_1_pBuffer))

!$acc kernels async(33)
!$acc loop collapse(2)
do I=1,NX
do J=1,NY
!$acc loop seq
do k=2,NZB-1

				!Obtaining the EX_1_MATA and EX_1_MATA
				
			enddo

			!============= Execute Analysis =============
			istat = cusparseDenseToSparse_analysis(handle, EX_1_MATA, EX_1_MATB, CUSPARSE_DENSETOSPARSE_ALG_DEFAULT, EX_1_pBuffer)
			!============= Get number of non-zero elements =============
			istat = cusparseSpMatGetSize( EX_1_MATB,EX_1_ROW,EX_1_COL, EX_1_NNZ)
			allocate( csrColInd(EX_1_NNZ),csrValues(EX_1_NNZ))
			!============= Allocate CSR column indices and values =============
			istat = cusparseCsrSetPointers(EX_1_MATB, csrRowOffsets,csrColInd,csrValues)
			!============= Execute Sparse to Dense conversion =============
			istat = cusparseDenseToSparse_convert(handle, EX_1_MATA, EX_1_MATB,CUSPARSE_DENSETOSPARSE_ALG_DEFAULT,EX_1_pBuffer)

			!============= Create dense vector X and Y =============
			istat = cusparseCreateDnVec(EX_1_VECX, EX_1_COL, EX_1_dX, CUDA_R_32F)
			istat = cusparseCreateDnVec(EX_1_VECY, EX_1_ROW, EX_1_dY, CUDA_R_32F)
			!============= Create opaque data structure, that holds analysis data between calls =============
			istat = cusparseSpSV_createDescr(spsvDescr)
			!============= Specify Lower|Upper fill mode =============
			istat = cusparseSpMatSetAttribute(EX_1_MATB,CUSPARSE_SPMAT_FILL_MODE,CUSPARSE_FILL_MODE_LOWER,c_sizeof(CUSPARSE_SPMAT_FILL_MODE))
			!============= Specify Unit|Non-Unit diagonal type =============
			istat = cusparseSpMatSetAttribute(EX_1_MATB,CUSPARSE_SPMAT_DIAG_TYPE,CUSPARSE_DIAG_TYPE_NON_UNIT,c_sizeof(CUSPARSE_SPMAT_DIAG_TYPE))
			!============= Allocate an external buffer for analysis =============
			istat = cusparseSpSV_bufferSize(handle,CUSPARSE_OPERATION_NON_TRANSPOSE,alpha,EX_1_MATB,EX_1_VECX,EX_1_VECY,CUDA_R_32F,CUSPARSE_SPSV_ALG_DEFAULT,spsvDescr,EX_1_pBuffer)
			
			!============= Execute analysis and SpSV =============
			istat = cusparseSpSV_analysis(handle,CUSPARSE_OPERATION_NON_TRANSPOSE,alpha,EX_1_MATB,EX_1_VECX,EX_1_VECY,CUDA_R_32F,CUSPARSE_SPSV_ALG_DEFAULT,spsvDescr,EX_1_buffer)
			istat = cusparseSpSV_solve(handle,CUSPARSE_OPERATION_NON_TRANSPOSE,alpha,EX_1_MATB,EX_1_VECX,EX_1_VECY,CUDA_R_32F,CUSPARSE_SPSV_ALG_DEFAULT,spsvDescr)
			!============= Destroy matrix/vector descriptors =============

			!$acc loop seq
			do k=2,NZB-1
				EX_1(I,J,K) = EX_1_dY(k-1)
			enddo
		enddo
	enddo
    
	!$acc end kernels

However, the compiler say to me : NVFORTRAN-S-1061-Procedures called in a compute region must have acc routine information - cusparsedensetosparse_analysis. Maybe, that meaning is that I can not put the “analysis” during the Kernel.
Can I generate firstly all Matrixes and then solve them by SOLVER, such as the function cusparseSgpsvInterleavedBatch? I don’t find the examples I need. CUDALibrarySamples/cuSPARSE at master · NVIDIA/CUDALibrarySamples (github.com)

cuSPARSE routines can only be called from the host. You can’t call them from within OpenACC compute regions.

If I need to parallely solve several linear system of equations, how should I do. Can I use the function cusparseSpSV_solve?

You can use cusparseSpSV_solve, you just need to call it from host code.

My idea is that this function only solve the system one by one. I need to solve several systems at once