cuBLASDGEMM Time Efficieny

Hi,

I used cuBLASDGEMM twice in a subroutine and measured the times that each dgemm routine takes. The first one takes 100 times more than the second one. Below I wrote a similar structure that I used in my code. Actually this code cannot solve dgemm, I could not figure it out. The code has one main code, one subroutine and a Makefile.

The main code:

PROGRAM MAIN

USE OPENACC

IMPLICIT NONE

INTEGER I,J
DOUBLE PRECISION Z

! DIMENSIONS of ARRAYS
INTEGER, PARAMETER N = 1000000
INTEGER, PARAMETER P = 100000
INTEGER, PARAMETER NUMPPV = 100000

DOUBLE PRECISION X(NUMPPV,P)
DOUBLE PRECISION Y(N,P)
DOUBLE PRECISION V(N,P)
DOUBLE PRECISION PPVT(NUMPPV,N)
DOUBLE PRECISION PPV(N,NUMPPV)
DOUBLE PRECISION HLV(NUMPPV)

INTEGER F 
DOUBLE PRECISION TIMEONE,TIMETWO,TIMETHREE

TIMEONE = 0.0D0
TIMETWO = 0.0D0
TIMETHREE = 0.0D0

Z = 0.01

! MATRIX GENERATION ---------------------------------------------

V = 0.0D0

DO I = 1,NUMPPV
	HLV(I) = Z + Z*I
END DO

!$ACC DATA CREATE(X,Y,PPVT) COPYOUT(X,Y,PPVT)
!$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,NUMPPV
  DO J = 1,P
    X(I,J) = Z + 0.000001*I*J
  END DO
END DO 
!$ACC END PARALLEL LOOP

!$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,N
  DO J = 1,P
    Y(I,J) = Z + 0.00001*I*J
  END DO
END DO 
!$ACC END PARALLEL LOOP

!$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,NUMPPV
  DO J = 1,N
    PPV(J,I) = Z + 0.00001*I*J
    PPVT(I,J) = Z + 0.00001*I*J
  END DO
END DO 
!$ACC END PARALLEL LOOP
!$ACC END DATA
! ---------------------------------------------------------------

F = 100

!$ACC DATA PCOPY(Y,V,X,HLV,PPV,PPVT)
DO i = 1,F
  
  CALL FUNC(N,P,NUMPPV,Y,V,X,HLV,PPV,PPVT,TIMEONE,TIMETWO,TIMETHREE)	

END DO
!$ACC END DATA


WRITE(*,*) "FIRST cuBLAS CALL: ", TIMEONE
WRITE(*,*) "SECOND cuBLAS CALL: ", TIMETWO
WRITE(*,*) "THIRD cuBLAS CALL: ", TIMETHREE

END PROGRAM MAIN

The subroutine:

    SUBROUTINE FUNC(N,P,NUMPPV,Y,V,X,HLV,PPV,PPVT,TIMEONE,TIMETWO,TIMETHREE)

    USE OPENACC
    USE CUBLAS_V2
    USE OMP_LIB

    IMPLICIT NONE

    ! COUNTERS
    INTEGER K

    ! VARIABLES
    INTEGER N
    INTEGER P
    INTEGER NUMPPV

    DOUBLE PRECISION X(NUMPPV,P)
    DOUBLE PRECISION Y(N,P)
    DOUBLE PRECISION V(N,P)
    DOUBLE PRECISION PPVT(NUMPPV,N)
    DOUBLE PRECISION PPV(N,NUMPPV)
    DOUBLE PRECISION HLV(NUMPPV)

    INTEGER PRINT_SCREEN
    DOUBLE PRECISION TASK_START
    DOUBLE PRECISION TASK_END
    DOUBLE PRECISION TASK_DURATION
    DOUBLE PRECISION TIMEONE,TIMETWO,TIMETHREE

    ! CUDA VARIABLES
    TYPE(cublasHandle) :: handle
    TYPE(cublasHandle) :: zandle

    INTEGER :: status

    TASK_START = OMP_GET_WTIME()
! CUDA BLAS ROUTINE -----------------------------------------------------
    status =cublasCreate(handle)
    IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"
    
    !$ACC HOST_DATA USE_DEVICE(Y(N,P), PPVT(:,:), X(NUMPPV,P))
status = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, NUMPPV, P, N, 1.0D0, PPVT(:,:), NUMPPV, Y(:,:), N, 0.0D0, X(:,:), NUMPPV)
    !$ACC END HOST_DATA
    
    IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
    status = cublasDestroy(handle)
! -----------------------------------------------------------------------
TASK_END = OMP_GET_WTIME() ; TASK_DURATION = TASK_END - TASK_START
TIMEONE = TIMEONE + TASK_DURATION

TASK_START = OMP_GET_WTIME()
    !$ACC PARALLEL LOOP
    DO K = 1,NUMPPV
      X(K,:) = HLV(K) * X(K,:)
    END DO
    !$ACC END PARALLEL LOOP
TASK_END = OMP_GET_WTIME() ; TASK_DURATION = TASK_END - TASK_START
TIMETWO = TIMETWO + TASK_DURATION

TASK_START = OMP_GET_WTIME()
    ! CUDA BLAS ROUTINE -----------------------------------------------------
    status =cublasCreate(zandle)
    IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"
    
    !$ACC HOST_DATA USE_DEVICE(PPV(:,:), X(NUMPPV,P), V(N,P))
    status = cublasDgemm(zandle, CUBLAS_OP_N, CUBLAS_OP_N, N, P, NUMPPV, 1.0D0, PPV(:,:), N, X(:,:), NUMPPV, 1.0D0, V, N)
    !$ACC END HOST_DATA
    
    IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
    status = cublasDestroy(zandle)
! -----------------------------------------------------------------------
TASK_END = OMP_GET_WTIME() ; TASK_DURATION = TASK_END - TASK_START
TIMETHREE = TIMETHREE + TASK_DURATION

    END SUBROUTINE FUNC

The Makefile:

FC=nvfortran
TIMER=/usr/bin/time
OPT=
NOPT=-fast -Minfo=opt $(OPT)
FCFLAGS = -Mpreprocess -acc=gpu -cuda -cudalib=cublas

main: main.o func.0
$(FC) $(FCFLAGS) -o main main.o func.o $(NOPT) -ta:tesla:cc70 -Minfo=accel -acc -omp

main.o: main.f90 func.o
$(FC) -c main.f90

func.o: func.f90
$(FC) -c func.f90

clean:
rm -f *.o *.exe *.s *.mod a.out

I call dgemm several times in my code and it accumulates to a considerable time. It correctly multiplies the matrices yet why the first call takes more time bothers me while the dimensions of the multiplications are the same. Do you know why the first subroutine takes more time than the second?

Thank you for your attention,
Yunus

Hi Yunus,

It’s the time to initialize the cuBlas library. You can add a call to cublasInit before your timer to move when the initialization is perform so it doesn’t show up in your timing, but the overall time of the application would be the same.

    ! CUDA VARIABLES
    TYPE(cublasHandle) :: handle
    TYPE(cublasHandle) :: zandle

    INTEGER :: status
    status = cublasInit()

    TASK_START = OMP_GET_WTIME()

Hope this helps,
Mat

Hi Mat,

The code loops for 100 times and it calls 2 times cuBLAS at each of this 100, and it initializes 100 times the cuBLAS library. Is it the way in which it does?
Is there a way to initialize it once and keep it throughout the entire loop?

Thank you,
Yunus

I believe the initialization only needs to occur once (assuming you don’t call cublasShutdown). I only put it there to show that the time no longer appear in your timer output. You should be able to move the call early in the program.

Okay. I’ll try it. Thank you for your attention.