Solving the mult- linear systems using the function cusolverSpDcsrqrsvBatched

SkyCool · August 27, 2024, 2:50pm

Hello,

I try to solve mult- linear systems using the function: cusolverSpDcsrqrsvBatched, however, I receive some errors when I compile the codes with the Fortran language.
My codes are

	use openacc
	! use cusolverDn
	use cusolverSp
	use CUDAFOR
	IMPLICIT NONE

The error message is: NVFORTRAN-F-0004-Unable to open MODULE file cusolversp.mod

Even I don’t write any other codes such as definition (e.g., REAL) or function.

MatColgrove · August 27, 2024, 5:41pm

Hi SkyCool,

We don’t provide an interface module for cuSolverSP, just cuSolverDN and cuSolverMP.

Please see NVIDIA Fortran CUDA Library Interfaces Version 24.7 for ARM, OpenPower, x86

This guide might be helpful if you want to write your own interface: NVIDIA Fortran CUDA Library Interfaces Version 24.7 for ARM, OpenPower, x86

-Mat

SkyCool · August 27, 2024, 11:05pm

That’s really bad news for me. If the cuSolverDN function is used, I must assemble a large linear system. My GPU can’t provide that large memory.

My multi-linear systems are sparse. Can I factor the multi-coefficient matrixes in parallel using some function, such as LU factorization？Then I manually calculate the results with the following steps: A=LU, Ly=b, and Ux=y. Of course, I also need some functions to calculate the results in parallel.

Thanks in advance.

MatColgrove · August 28, 2024, 3:51pm

Again, it’s not that you can’t use cuSolverSP, it’s just that you’ll need to write your own interface. For good or bad, cuSolverSP isn’t as popular as the others.

Can I factor the multi-coefficient matrixes in parallel using some function, such as LU factorization？

That’s outside my area of expertise, so I wont be much help here. I’d suggest ask this on the cuSolver forum or consult the cuSolver documentation.

SkyCool · August 30, 2024, 2:23pm

I notice the function “cusolverDnSetStream”. Can I use the multiple cuda streams to deal with the linear systems in parallel? Here is my codes:
do n=1, (i-1)*(NYB-2)+j-1

	istat = cusolverDnCreate(handle(n))
	istat = cusolverDnSetStream(handle(n),acc_get_cuda_stream(n)) 
	istat = cusolverDnCreateParams(param(n))
	istat = cusolverDnSetAdvOptions(param(n), CUSOLVERDN_GETRF, CUSOLVER_ALG_0)
end do
istat = cusolverDnXgetrf_bufferSize( handle( 1 ), param( 1 ), NZ-1, NZ-1, cudaDataType(CUDA_R_64F), MatVal_EX_1,NZ-1,cudaDataType(CUDA_R_64F), workspaceInBytesOnDevice_EX_1, workspaceInBytesOnHost_EX_1 )
ALLOCATE(bufferOnDevice_EX_1(workspaceInBytesOnDevice_EX_1) , bufferOnHost_EX_1(workspaceInBytesOnHost_EX_1)  )


do I=1,NX
	do J=2,NYB-1

		!$acc kernels async( (i-1)*(NYB-2)+j-1 ) 
		AZ_1=0

		!$acc loop collapse(1) independent private(Collabel)
		do k=2,NZB-1

			AZ_1(k-1, (k-1) ) = parameter1
			!$acc loop seq
			do Collabel=1, NZB-2
				MatVal_EX_1( (NZ-1)*(k-2) + Collabel ) = AZ_1( k-1,Collabel )
			enddo

			RHSVal_EX_1(k-1)=paremeter2
				
		enddo
		!$acc end kernels

		istat = cusolverDnXgetrf( handle((i-1)*(NYB-2)+j-1), param((i-1)*(NYB-2)+j-1), ROW_EX_1,    COL_EX_1,    cudaDataType(CUDA_R_64F), MatVal_EX_1, LDA_EX_1,  ipiv_EX_1, cudaDataType(CUDA_R_64F), bufferOnDevice_EX_1,workspaceInBytesOnDevice_EX_1,bufferOnHost_EX_1, workspaceInBytesOnHost_EX_1,devinfoX )
		istat = cusolverDnXgetrs( handle((i-1)*(NYB-2)+j-1), param((i-1)*(NYB-2)+j-1), CUBLAS_OP_T, ROW_EX_1, 1, cudaDataType(CUDA_R_64F), MatVal_EX_1, LDA_EX_1,  ipiv_EX_1, cudaDataType(CUDA_R_64F), RHSVal_EX_1, LDB_EX_1, devinfoX )

		!$acc kernels async( (i-1)*(NYB-2)+j-1 )
		!$acc loop collapse(1)
		do k=2,NZB-1
			EX_1(I,J,k) = RHSVal_EX_1(k-1)
		enddo
		!$acc end kernels

	enddo
enddo

Unfortunately, these calculations produce incorrect results and require a large amount of memory.

MatColgrove · August 30, 2024, 3:00pm

these calculations produce incorrect results

Might be due to RHSVal_EX_1 and bufferOnDevice_EX_1 being shared,

require a large amount of memory.

Not unexpected. Consider reducing the number of handles down to the number of kernels that can be executed concurrently on the device. Then execute them in batches reusing the handles.

Topic		Replies	Views
Memory space used by cusolverDN_getrf GPU-Accelerated Libraries cusolver	9	1758	July 9, 2021
Using cuSolverDN in FORTRAN code GPU-Accelerated Libraries	8	4279	November 18, 2015
[openACC]nvfortran minloc/maxloc became unable after update to sdk22.7, cuda11.7 + libcudaforwraprand.so error on the execution nvc, nvc++ and nvfortran	5	661	August 17, 2022
Issues using cusolver with fortran nvc, nvc++ and nvfortran	5	870	June 24, 2021
Using cusparseDgtsv2_nopivot() with OpenACC in Fortran code nvc, nvc++ and nvfortran cuda	6	422	November 13, 2023
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran GPU-Accelerated Libraries	2	1162	April 11, 2014
How to solve a tridiagonal matrix using the cusparse<t>gtsv2_nopivot() functions in the cusparse library GPU-Accelerated Libraries cusparse	7	787	November 5, 2023
Cusparse for solving the sparse linear equation Ax=b Legacy PGI Compilers	8	2019	August 30, 2019
The problem for the function <cusolverDnXgetrs> nvc, nvc++ and nvfortran cuda	2	22	August 16, 2024
CUDA Multi GPU matrix multiplication advice CUDA Programming and Performance	2	2336	April 24, 2017

Solving the mult- linear systems using the function cusolverSpDcsrqrsvBatched

Related topics