I’ve been working on a Fortran code which uses the cuBLAS batched LU and cuSPARSE batched tridiagonal solver as part of a BiCG iterative solver with ADI preconditioner.I’m using a Kepler K20X with compute capability 3.5 and CUDA 5.5. I’m doing this without PGI’s CUDA Fortran, so I’m writing my own interfaces:
FUNCTION cublasDgetrfBatched(handle, n, dA, ldda, dP, dInfo, nbatch) BIND(C, NAME="cublasDgetrfBatched") USE, INTRINSIC :: ISO_C_BINDING INTEGER(KIND(CUBLAS_STATUS_SUCCESS)) :: cublasDgetrfBatched TYPE(C_PTR), VALUE :: handle INTEGER(C_INT), VALUE :: n TYPE(C_PTR), VALUE :: dA INTEGER(C_INT), VALUE :: ldda TYPE(C_PTR), VALUE :: dP TYPE(C_PTR), VALUE :: dInfo INTEGER(C_INT), VALUE :: nbatch END FUNCTION cublasDgetrfBatched
I allocate pinned memory on the host with cudaHostAlloc, allocate the device memory for the matrices and the device array containing the device pointers to the matrices, asynchronously copy each matrix to the device, perform the operations, and then asynchronously copy the decomposed matrix and pivots back to the host to perform the back-substitution with a single right-hand side:
REAL(8), POINTER, DIMENSION(:,:,:) :: A INTEGER, DIMENSION(:,:), POINTER :: ipiv TYPE(C_PTR) :: cPtr_A, cPtr_ipiv TYPE(C_PTR), ALLOCATABLE, DIMENSION(:), TARGET :: dPtr_A TYPE(C_PTR) :: dPtr_ipiv, dPtr_A_d, dPtr_info INTEGER(C_SIZE_T) :: sizeof_A, sizeof_ipiv ... stat = cudaHostAlloc(cPtr_A, sizeof_A, cudaHostAllocDefault) CALL C_F_POINTER(cPtr_A, A, (/m,m,nbatch/)) stat = cudaHostAlloc(cPtr_ipiv, sizeof_ipiv, cudaHostAllocDefault) CALL C_F_POINTER(cPtr_ipiv, ipiv, (/m,nbatch/)) ALLOCATE(dPtr_A(nbatch)) DO ibatch=1,nbatch stat = cudaMalloc(dPtr_A(ibatch), m*m*sizeof_double) END DO stat = cudaMalloc(dPtr_A_d, nbatch*sizeof_cptr) stat = cublasSetVector(nbatch, sizeof_cptr, C_LOC(dPtr_A(1)), 1, dPtr_A_d, 1) stat = cudaMalloc(dPtr_ipiv, m*nbatch*sizeof_cint) stat = cudaMalloc(dPtr_info, nbatch*sizeof_cint) ... !$OMP PARALLEL DEFAULT(shared) PRIVATE( stat, ibatch ) !$OMP DO DO ibatch = 1,nbatch stat = cublasSetMatrixAsync(m, m, sizeof_double, C_LOC(A(1,1,ibatch)), m, dPtr_A(ibatch), m, mystream) END DO !$OMP END DO !$OMP END PARALLEL ... stat = cublasDgetrfBatched(cublas_handle, m, dPtr_A_d, m, dPtr_ipiv, dPtr_info, nbatch) ... stat = cublasGetMatrixAsync(m, nbatch, sizeof_cint, dPtr_ipiv, m, C_LOC(ipiv(1,1)), m, mystream) !$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, stat ) !$OMP DO DO ibatch = 1,nbatch stat = cublasGetMatrixAsync(m, m, sizeof_double, dPtr_A(ibatch), m, C_LOC(A(1,1,ibatch)), m, mystream) END DO !$OMP END DO !$OMP END PARALLEL ... !$OMP PARALLEL DEFAULT(shared) PRIVATE( ibatch, x, stat ) !$OMP DO DO ibatch = 1,nbatch x = rhs(:,ibatch) CALL dgetrs( 'N', m, 1, A(1,1,ibatch), m, ipiv(1,ibatch), x(1), m, info ) rhs(:,ibatch) = x END DO !$OMP END DO !$OMP END PARALLEL ...
I’d rather not have to do this last step, but the cublasDtrsmBatched routine limits the matrix size to 32, and mine are size 80 (a batched Dtrsv would be better, but this doesn’t exist). The cost of launching multiple individual cublasDtrsv kernels makes performing the back-sub on the device untenable.
There are other operations which I need to perform between calls to cublasDgetrfBatched and cusparseDgtsvStridedBatch. Most of these are currently being performed on the host with OpenMP being used to parallelize the loops at the batched level. Some of the operations, like matrix-vector multiplication for each of the matrices being decomposed for example, are being computed on the device with OpenACC:
!$ACC DATA COPYIN(A) COPYIN(x) COPYOUT(Ax) ... !$ACC KERNELS DO ibatch = 1,nbatch DO i = 1,m Ax(i,ibatch) = zero END DO DO j = 1,m DO i = 1,m Ax(i,ibatch) = Ax(i,ibatch) + A(i,j,ibatch)*x(j,ibatch) END DO END DO END DO !$ACC END KERNELS ... !$ACC END DATA
I’d like to place more of the computation on the GPU with OpenACC, but to do so I need to be able to interface the two. Something like the following:
!$ACC DATA COPYIN(A) CREATE(info,A_d) COPYOUT(ipiv) !$ACC HOST_DATA USE_DEVICE(A) DO ibatch = 1,nbatch A_d(ibatch) = acc_deviceptr(A(1,1,ibatch)) END DO !$ACC END HOST_DATA ... !$ACC HOST_DATA USE_DEVICE(ipiv,info) stat = cublasDgetrfBatched(cublas_handle, m, Adense_d, m, ipiv, info, nbatch) !$ACC END HOST_DATA ... !$ACC END DATA
I know the host_data construct with the host_device clauses would be appropriate in most cases, but since I need to actually pass to cuBLAS a device array containing the pointers to the matrices on the device, I’m not sure how to proceed.
Can anyone offer any insight?