Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran

jah87 · April 10, 2014, 5:15pm

I’ve been working on a Fortran code which uses the cuBLAS batched LU and cuSPARSE batched tridiagonal solver as part of a BiCG iterative solver with ADI preconditioner.I’m using a Kepler K20X with compute capability 3.5 and CUDA 5.5. I’m doing this without PGI’s CUDA Fortran, so I’m writing my own interfaces:

FUNCTION cublasDgetrfBatched(handle, n, dA, ldda, dP, dInfo, nbatch) BIND(C, NAME="cublasDgetrfBatched")
    USE, INTRINSIC :: ISO_C_BINDING
    INTEGER(KIND(CUBLAS_STATUS_SUCCESS)) :: cublasDgetrfBatched
    TYPE(C_PTR), VALUE :: handle
    INTEGER(C_INT), VALUE :: n
    TYPE(C_PTR), VALUE :: dA
    INTEGER(C_INT), VALUE :: ldda
    TYPE(C_PTR), VALUE :: dP
    TYPE(C_PTR), VALUE :: dInfo
    INTEGER(C_INT), VALUE :: nbatch
END FUNCTION cublasDgetrfBatched

I allocate pinned memory on the host with cudaHostAlloc, allocate the device memory for the matrices and the device array containing the device pointers to the matrices, asynchronously copy each matrix to the device, perform the operations, and then asynchronously copy the decomposed matrix and pivots back to the host to perform the back-substitution with a single right-hand side:

REAL(8), POINTER, DIMENSION(:,:,:) :: A
INTEGER, DIMENSION(:,:), POINTER :: ipiv
TYPE(C_PTR) :: cPtr_A, cPtr_ipiv
TYPE(C_PTR), ALLOCATABLE, DIMENSION(:), TARGET :: dPtr_A
TYPE(C_PTR) :: dPtr_ipiv, dPtr_A_d, dPtr_info
INTEGER(C_SIZE_T) :: sizeof_A, sizeof_ipiv

...

stat = cudaHostAlloc(cPtr_A, sizeof_A, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_A, A, (/m,m,nbatch/))
stat = cudaHostAlloc(cPtr_ipiv, sizeof_ipiv, cudaHostAllocDefault)
CALL C_F_POINTER(cPtr_ipiv, ipiv, (/m,nbatch/))

ALLOCATE(dPtr_A(nbatch))
DO ibatch=1,nbatch
  stat = cudaMalloc(dPtr_A(ibatch), m*m*sizeof_double)
END DO
stat = cudaMalloc(dPtr_A_d, nbatch*sizeof_cptr)
stat = cublasSetVector(nbatch, sizeof_cptr, C_LOC(dPtr_A(1)), 1, dPtr_A_d, 1)
stat = cudaMalloc(dPtr_ipiv, m*nbatch*sizeof_cint)
stat = cudaMalloc(dPtr_info, nbatch*sizeof_cint)

...

!$OMP PARALLEL DEFAULT(shared) PRIVATE( stat, ibatch )
!$OMP DO
DO ibatch = 1,nbatch
  stat = cublasSetMatrixAsync(m, m, sizeof_double, C_LOC(A(1,1,ibatch)), m, dPtr_A(ibatch), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL

...

stat = cublasDgetrfBatched(cublas_handle, m, dPtr_A_d, m, dPtr_ipiv, dPtr_info, nbatch)

...

stat = cublasGetMatrixAsync(m, nbatch, sizeof_cint, dPtr_ipiv, m, C_LOC(ipiv(1,1)), m, mystream)

!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, stat )
!$OMP DO
DO ibatch = 1,nbatch
  stat = cublasGetMatrixAsync(m, m, sizeof_double, dPtr_A(ibatch), m, C_LOC(A(1,1,ibatch)), m, mystream)
END DO
!$OMP END DO
!$OMP END PARALLEL

...

!$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, x, stat )
!$OMP DO
DO ibatch = 1,nbatch
  x = rhs(:,ibatch)
  CALL dgetrs( 'N', m, 1, A(1,1,ibatch), m, ipiv(1,ibatch), x(1), m, info )
  rhs(:,ibatch) = x
END DO
!$OMP END DO
!$OMP END PARALLEL

...

I’d rather not have to do this last step, but the cublasDtrsmBatched routine limits the matrix size to 32, and mine are size 80 (a batched Dtrsv would be better, but this doesn’t exist). The cost of launching multiple individual cublasDtrsv kernels makes performing the back-sub on the device untenable.

There are other operations which I need to perform between calls to cublasDgetrfBatched and cusparseDgtsvStridedBatch. Most of these are currently being performed on the host with OpenMP being used to parallelize the loops at the batched level. Some of the operations, like matrix-vector multiplication for each of the matrices being decomposed for example, are being computed on the device with OpenACC:

!$ACC DATA COPYIN(A) COPYIN(x) COPYOUT(Ax)

...

!$ACC KERNELS
  DO ibatch = 1,nbatch
    DO i = 1,m
      Ax(i,ibatch) = zero
    END DO
    DO j = 1,m
      DO i = 1,m
        Ax(i,ibatch) = Ax(i,ibatch) + A(i,j,ibatch)*x(j,ibatch)
      END DO
    END DO
  END DO
!$ACC END KERNELS

...

!$ACC END DATA

I’d like to place more of the computation on the GPU with OpenACC, but to do so I need to be able to interface the two. Something like the following:

!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv)

!$ACC HOST_DATA USE_DEVICE(A)
DO ibatch = 1,nbatch
  A_d(ibatch) = acc_deviceptr(A(1,1,ibatch))
END DO
!$ACC END HOST_DATA

...

!$ACC HOST_DATA USE_DEVICE(ipiv,info)
stat = cublasDgetrfBatched(cublas_handle, m, Adense_d, m, ipiv, info, nbatch)
!$ACC END HOST_DATA

...

!$ACC END DATA

I know the host_data construct with the host_device clauses would be appropriate in most cases, but since I need to actually pass to cuBLAS a device array containing the pointers to the matrices on the device, I’m not sure how to proceed.

Can anyone offer any insight?

Thanks

mfatica · April 11, 2014, 5:04pm

Have you considered using CUDA Fortran, since you are already using the PGI compiler?

This is an example that shows how to call the Cublas batched routines from CUDA Fortran:

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-how-call-batched-cublas-routines-cuda-fortran/

Using the CUF kernels, you can easily auto generate kernels for loop and reductions.

More examples are at http://devblogs.nvidia.com/parallelforall/tag/cuda-fortran/
or you can take a look at the “CUDA Fortran for Scientists and Engineers” book.

jah87 · April 11, 2014, 5:16pm

I’m not using the PGI compiler. In my experience, PGI just isn’t up to snuff with CPU optimizations compared to the Cray or Intel compilers. Compiling a library for only the GPU-intensive sections of my code with PGI may not be out of the question, though. I am able to call the cuBLAS batched routines from Fortran without CUDA Fortran, just not in conjunction with OpenACC.

Topic		Replies	Views
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran CUDA Programming and Performance	0	602	April 10, 2014
openacc with cublas batched routine in fortran Legacy PGI Compilers	7	8325	January 27, 2017
CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran Technical Blog	4	444	April 27, 2015
PGF90-S-0155-Could not resolve generic procedure cublasdgemm Legacy PGI Compilers	4	4029	November 10, 2018
Calling CUBLAS' cublasDgetrfBatched proper procedure Legacy PGI Compilers	3	4952	January 30, 2017
OpenACC with cuBLAS and cuSPARSE in Fortran code Legacy PGI Compilers	7	8574	February 22, 2016
Using Cublas in Device Kernels Legacy PGI Compilers	0	5628	August 14, 2013
How to parallelize nested loops with cublas and cusolver nvc, nvc++ and nvfortran	6	194	November 25, 2024
Lapack calls from the gpu (using openacc in nvfortran) nvc, nvc++ and nvfortran	2	622	May 11, 2023
Trouble with cublas and matrix inverse in Fortran Legacy PGI Compilers	1	1980	April 9, 2018

Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran

Related topics