Solving the mult- linear systems using the function cusolverSpDcsrqrsvBatched

Hello,

I try to solve mult- linear systems using the function: cusolverSpDcsrqrsvBatched, however, I receive some errors when I compile the codes with the Fortran language.
My codes are

use openacc
! use cusolverDn
use cusolverSp
use CUDAFOR
IMPLICIT NONE

The error message is: NVFORTRAN-F-0004-Unable to open MODULE file cusolversp.mod

Even I don’t write any other codes such as definition (e.g., REAL) or function.

Hi SkyCool,

We don’t provide an interface module for cuSolverSP, just cuSolverDN and cuSolverMP.

Please see NVIDIA Fortran CUDA Library Interfaces Version 24.7 for ARM, OpenPower, x86

This guide might be helpful if you want to write your own interface: NVIDIA Fortran CUDA Library Interfaces Version 24.7 for ARM, OpenPower, x86

-Mat

That’s really bad news for me. If the cuSolverDN function is used, I must assemble a large linear system. My GPU can’t provide that large memory.

My multi-linear systems are sparse. Can I factor the multi-coefficient matrixes in parallel using some function, such as LU factorization?Then I manually calculate the results with the following steps: A=LU, Ly=b, and Ux=y. Of course, I also need some functions to calculate the results in parallel.

Thanks in advance.

Again, it’s not that you can’t use cuSolverSP, it’s just that you’ll need to write your own interface. For good or bad, cuSolverSP isn’t as popular as the others.

Can I factor the multi-coefficient matrixes in parallel using some function, such as LU factorization?

That’s outside my area of expertise, so I wont be much help here. I’d suggest ask this on the cuSolver forum or consult the cuSolver documentation.

I notice the function “cusolverDnSetStream”. Can I use the multiple cuda streams to deal with the linear systems in parallel? Here is my codes:
do n=1, (i-1)*(NYB-2)+j-1

	istat = cusolverDnCreate(handle(n))
	istat = cusolverDnSetStream(handle(n),acc_get_cuda_stream(n)) 
	istat = cusolverDnCreateParams(param(n))
	istat = cusolverDnSetAdvOptions(param(n), CUSOLVERDN_GETRF, CUSOLVER_ALG_0)
end do
istat = cusolverDnXgetrf_bufferSize( handle( 1 ), param( 1 ), NZ-1, NZ-1, cudaDataType(CUDA_R_64F), MatVal_EX_1,NZ-1,cudaDataType(CUDA_R_64F), workspaceInBytesOnDevice_EX_1, workspaceInBytesOnHost_EX_1 )
ALLOCATE(bufferOnDevice_EX_1(workspaceInBytesOnDevice_EX_1) , bufferOnHost_EX_1(workspaceInBytesOnHost_EX_1)  )


do I=1,NX
	do J=2,NYB-1

		!$acc kernels async( (i-1)*(NYB-2)+j-1 ) 
		AZ_1=0

		!$acc loop collapse(1) independent private(Collabel)
		do k=2,NZB-1

			AZ_1(k-1, (k-1) ) = parameter1
			!$acc loop seq
			do Collabel=1, NZB-2
				MatVal_EX_1( (NZ-1)*(k-2) + Collabel ) = AZ_1( k-1,Collabel )
			enddo

			RHSVal_EX_1(k-1)=paremeter2
				
		enddo
		!$acc end kernels

		istat = cusolverDnXgetrf( handle((i-1)*(NYB-2)+j-1), param((i-1)*(NYB-2)+j-1), ROW_EX_1,    COL_EX_1,    cudaDataType(CUDA_R_64F), MatVal_EX_1, LDA_EX_1,  ipiv_EX_1, cudaDataType(CUDA_R_64F), bufferOnDevice_EX_1,workspaceInBytesOnDevice_EX_1,bufferOnHost_EX_1, workspaceInBytesOnHost_EX_1,devinfoX )
		istat = cusolverDnXgetrs( handle((i-1)*(NYB-2)+j-1), param((i-1)*(NYB-2)+j-1), CUBLAS_OP_T, ROW_EX_1, 1, cudaDataType(CUDA_R_64F), MatVal_EX_1, LDA_EX_1,  ipiv_EX_1, cudaDataType(CUDA_R_64F), RHSVal_EX_1, LDB_EX_1, devinfoX )

		!$acc kernels async( (i-1)*(NYB-2)+j-1 )
		!$acc loop collapse(1)
		do k=2,NZB-1
			EX_1(I,J,k) = RHSVal_EX_1(k-1)
		enddo
		!$acc end kernels

	enddo
enddo

Unfortunately, these calculations produce incorrect results and require a large amount of memory.

these calculations produce incorrect results

Might be due to RHSVal_EX_1 and bufferOnDevice_EX_1 being shared,

require a large amount of memory.

Not unexpected. Consider reducing the number of handles down to the number of kernels that can be executed concurrently on the device. Then execute them in batches reusing the handles.