Hi Mat,
Can I parallelize a do loop which contains a call to cublasdgemm in OpenACC?
For example see the code below.
PROGRAM GPUMATMUL
USE CUBLAS
IMPLICIT NONE
INTEGER I, J, K, N, NR, NZ, NN
PARAMETER (NR=20, NZ=100, NN=1280)
DOUBLE PRECISION WORK(1:NR-1,1:NZ-1,2,0:NN-1)
DOUBLE PRECISION TWORK(1:NR-1,1:NZ-1,2,0:NN-1)
DOUBLE PRECISION H(1:NR-1,1:NZ-1,0:NN-1)
DOUBLE PRECISION BZN(1:NZ-1,1:NZ-1), ID, TMP
LOGICAL EQUAL
EQUAL = .TRUE.
DO K=0, NN-1
DO J=1, NZ-1
DO I=1, NR-1
ID = DFLOAT(J + (NZ-1)*(I-1))
H(I,J,K) = ID
END DO
END DO
END DO
BZN = 1.0D0
!$ACC DATA COPYIN(H,BZN) COPYOUT(WORK)
!$ACC HOST_DATA USE_DEVICE(H,BZN,WORK)
DO K=0, NN-1
CALL CUBLASDGEMM('N','N',NR-1,NZ-1,NZ-1,1.d0,H(1,1,K),NR-1,BZN,
& NZ-1,0.D0,WORK(1,1,1,K),NR-1)
END DO
!$ACC END HOST_DATA
!$ACC END DATA
END PROGRAM
Is it possible to do all the matrix multiplications at once in one kernel using OpenACC? Or another way to do so?
On a side note, I have a question regarding cublasdgemm. From my understanding from reading OpenACC documentation, the cublasdgemm is being called on the host which then sends the data to the gpu using the gpu pointers supplied in the use_device clause. The matrix multiplication is then performed on the device/gpu. Is this correct?