Hi,
I used cuBLASDGEMM twice in a subroutine and measured the times that each dgemm routine takes. The first one takes 100 times more than the second one. Below I wrote a similar structure that I used in my code. Actually this code cannot solve dgemm, I could not figure it out. The code has one main code, one subroutine and a Makefile.
The main code:
PROGRAM MAIN
USE OPENACC
IMPLICIT NONE
INTEGER I,J
DOUBLE PRECISION Z
! DIMENSIONS of ARRAYS
INTEGER, PARAMETER N = 1000000
INTEGER, PARAMETER P = 100000
INTEGER, PARAMETER NUMPPV = 100000
DOUBLE PRECISION X(NUMPPV,P)
DOUBLE PRECISION Y(N,P)
DOUBLE PRECISION V(N,P)
DOUBLE PRECISION PPVT(NUMPPV,N)
DOUBLE PRECISION PPV(N,NUMPPV)
DOUBLE PRECISION HLV(NUMPPV)
INTEGER F
DOUBLE PRECISION TIMEONE,TIMETWO,TIMETHREE
TIMEONE = 0.0D0
TIMETWO = 0.0D0
TIMETHREE = 0.0D0
Z = 0.01
! MATRIX GENERATION ---------------------------------------------
V = 0.0D0
DO I = 1,NUMPPV
HLV(I) = Z + Z*I
END DO
!$ACC DATA CREATE(X,Y,PPVT) COPYOUT(X,Y,PPVT)
!$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,NUMPPV
DO J = 1,P
X(I,J) = Z + 0.000001*I*J
END DO
END DO
!$ACC END PARALLEL LOOP
!$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,N
DO J = 1,P
Y(I,J) = Z + 0.00001*I*J
END DO
END DO
!$ACC END PARALLEL LOOP
!$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,NUMPPV
DO J = 1,N
PPV(J,I) = Z + 0.00001*I*J
PPVT(I,J) = Z + 0.00001*I*J
END DO
END DO
!$ACC END PARALLEL LOOP
!$ACC END DATA
! ---------------------------------------------------------------
F = 100
!$ACC DATA PCOPY(Y,V,X,HLV,PPV,PPVT)
DO i = 1,F
CALL FUNC(N,P,NUMPPV,Y,V,X,HLV,PPV,PPVT,TIMEONE,TIMETWO,TIMETHREE)
END DO
!$ACC END DATA
WRITE(*,*) "FIRST cuBLAS CALL: ", TIMEONE
WRITE(*,*) "SECOND cuBLAS CALL: ", TIMETWO
WRITE(*,*) "THIRD cuBLAS CALL: ", TIMETHREE
END PROGRAM MAIN
The subroutine:
SUBROUTINE FUNC(N,P,NUMPPV,Y,V,X,HLV,PPV,PPVT,TIMEONE,TIMETWO,TIMETHREE)
USE OPENACC
USE CUBLAS_V2
USE OMP_LIB
IMPLICIT NONE
! COUNTERS
INTEGER K
! VARIABLES
INTEGER N
INTEGER P
INTEGER NUMPPV
DOUBLE PRECISION X(NUMPPV,P)
DOUBLE PRECISION Y(N,P)
DOUBLE PRECISION V(N,P)
DOUBLE PRECISION PPVT(NUMPPV,N)
DOUBLE PRECISION PPV(N,NUMPPV)
DOUBLE PRECISION HLV(NUMPPV)
INTEGER PRINT_SCREEN
DOUBLE PRECISION TASK_START
DOUBLE PRECISION TASK_END
DOUBLE PRECISION TASK_DURATION
DOUBLE PRECISION TIMEONE,TIMETWO,TIMETHREE
! CUDA VARIABLES
TYPE(cublasHandle) :: handle
TYPE(cublasHandle) :: zandle
INTEGER :: status
TASK_START = OMP_GET_WTIME()
! CUDA BLAS ROUTINE -----------------------------------------------------
status =cublasCreate(handle)
IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"
!$ACC HOST_DATA USE_DEVICE(Y(N,P), PPVT(:,:), X(NUMPPV,P))
status = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, NUMPPV, P, N, 1.0D0, PPVT(:,:), NUMPPV, Y(:,:), N, 0.0D0, X(:,:), NUMPPV)
!$ACC END HOST_DATA
IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
status = cublasDestroy(handle)
! -----------------------------------------------------------------------
TASK_END = OMP_GET_WTIME() ; TASK_DURATION = TASK_END - TASK_START
TIMEONE = TIMEONE + TASK_DURATION
TASK_START = OMP_GET_WTIME()
!$ACC PARALLEL LOOP
DO K = 1,NUMPPV
X(K,:) = HLV(K) * X(K,:)
END DO
!$ACC END PARALLEL LOOP
TASK_END = OMP_GET_WTIME() ; TASK_DURATION = TASK_END - TASK_START
TIMETWO = TIMETWO + TASK_DURATION
TASK_START = OMP_GET_WTIME()
! CUDA BLAS ROUTINE -----------------------------------------------------
status =cublasCreate(zandle)
IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"
!$ACC HOST_DATA USE_DEVICE(PPV(:,:), X(NUMPPV,P), V(N,P))
status = cublasDgemm(zandle, CUBLAS_OP_N, CUBLAS_OP_N, N, P, NUMPPV, 1.0D0, PPV(:,:), N, X(:,:), NUMPPV, 1.0D0, V, N)
!$ACC END HOST_DATA
IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
status = cublasDestroy(zandle)
! -----------------------------------------------------------------------
TASK_END = OMP_GET_WTIME() ; TASK_DURATION = TASK_END - TASK_START
TIMETHREE = TIMETHREE + TASK_DURATION
END SUBROUTINE FUNC
The Makefile:
FC=nvfortran
TIMER=/usr/bin/time
OPT=
NOPT=-fast -Minfo=opt $(OPT)
FCFLAGS = -Mpreprocess -acc=gpu -cuda -cudalib=cublas
main: main.o func.0
$(FC) $(FCFLAGS) -o main main.o func.o $(NOPT) -ta:tesla:cc70 -Minfo=accel -acc -omp
main.o: main.f90 func.o
$(FC) -c main.f90
func.o: func.f90
$(FC) -c func.f90
clean:
rm -f *.o *.exe *.s *.mod a.out
I call dgemm several times in my code and it accumulates to a considerable time. It correctly multiplies the matrices yet why the first call takes more time bothers me while the dimensions of the multiplications are the same. Do you know why the first subroutine takes more time than the second?
Thank you for your attention,
Yunus