cuBLASDGEMM Time Efficieny

Hi,

I used cuBLASDGEMM twice in a subroutine and measured the times that each dgemm routine takes. The first one takes 100 times more than the second one. Below I wrote a similar structure that I used in my code. Actually this code cannot solve dgemm, I could not figure it out. The code has one main code, one subroutine and a Makefile.

The main code:

``````PROGRAM MAIN

USE OPENACC

IMPLICIT NONE

INTEGER I,J
DOUBLE PRECISION Z

! DIMENSIONS of ARRAYS
INTEGER, PARAMETER N = 1000000
INTEGER, PARAMETER P = 100000
INTEGER, PARAMETER NUMPPV = 100000

DOUBLE PRECISION X(NUMPPV,P)
DOUBLE PRECISION Y(N,P)
DOUBLE PRECISION V(N,P)
DOUBLE PRECISION PPVT(NUMPPV,N)
DOUBLE PRECISION PPV(N,NUMPPV)
DOUBLE PRECISION HLV(NUMPPV)

INTEGER F
DOUBLE PRECISION TIMEONE,TIMETWO,TIMETHREE

TIMEONE = 0.0D0
TIMETWO = 0.0D0
TIMETHREE = 0.0D0

Z = 0.01

! MATRIX GENERATION ---------------------------------------------

V = 0.0D0

DO I = 1,NUMPPV
HLV(I) = Z + Z*I
END DO

!\$ACC DATA CREATE(X,Y,PPVT) COPYOUT(X,Y,PPVT)
!\$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,NUMPPV
DO J = 1,P
X(I,J) = Z + 0.000001*I*J
END DO
END DO
!\$ACC END PARALLEL LOOP

!\$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,N
DO J = 1,P
Y(I,J) = Z + 0.00001*I*J
END DO
END DO
!\$ACC END PARALLEL LOOP

!\$ACC PARALLEL LOOP COLLAPSE(2)
DO I = 1,NUMPPV
DO J = 1,N
PPV(J,I) = Z + 0.00001*I*J
PPVT(I,J) = Z + 0.00001*I*J
END DO
END DO
!\$ACC END PARALLEL LOOP
!\$ACC END DATA
! ---------------------------------------------------------------

F = 100

!\$ACC DATA PCOPY(Y,V,X,HLV,PPV,PPVT)
DO i = 1,F

CALL FUNC(N,P,NUMPPV,Y,V,X,HLV,PPV,PPVT,TIMEONE,TIMETWO,TIMETHREE)

END DO
!\$ACC END DATA

WRITE(*,*) "FIRST cuBLAS CALL: ", TIMEONE
WRITE(*,*) "SECOND cuBLAS CALL: ", TIMETWO
WRITE(*,*) "THIRD cuBLAS CALL: ", TIMETHREE

END PROGRAM MAIN
``````

The subroutine:

``````    SUBROUTINE FUNC(N,P,NUMPPV,Y,V,X,HLV,PPV,PPVT,TIMEONE,TIMETWO,TIMETHREE)

USE OPENACC
USE CUBLAS_V2
USE OMP_LIB

IMPLICIT NONE

! COUNTERS
INTEGER K

! VARIABLES
INTEGER N
INTEGER P
INTEGER NUMPPV

DOUBLE PRECISION X(NUMPPV,P)
DOUBLE PRECISION Y(N,P)
DOUBLE PRECISION V(N,P)
DOUBLE PRECISION PPVT(NUMPPV,N)
DOUBLE PRECISION PPV(N,NUMPPV)
DOUBLE PRECISION HLV(NUMPPV)

INTEGER PRINT_SCREEN
DOUBLE PRECISION TIMEONE,TIMETWO,TIMETHREE

! CUDA VARIABLES
TYPE(cublasHandle) :: handle
TYPE(cublasHandle) :: zandle

INTEGER :: status

! CUDA BLAS ROUTINE -----------------------------------------------------
status =cublasCreate(handle)
IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"

!\$ACC HOST_DATA USE_DEVICE(Y(N,P), PPVT(:,:), X(NUMPPV,P))
status = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, NUMPPV, P, N, 1.0D0, PPVT(:,:), NUMPPV, Y(:,:), N, 0.0D0, X(:,:), NUMPPV)
!\$ACC END HOST_DATA

IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
status = cublasDestroy(handle)
! -----------------------------------------------------------------------

!\$ACC PARALLEL LOOP
DO K = 1,NUMPPV
X(K,:) = HLV(K) * X(K,:)
END DO
!\$ACC END PARALLEL LOOP

! CUDA BLAS ROUTINE -----------------------------------------------------
status =cublasCreate(zandle)
IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"

!\$ACC HOST_DATA USE_DEVICE(PPV(:,:), X(NUMPPV,P), V(N,P))
status = cublasDgemm(zandle, CUBLAS_OP_N, CUBLAS_OP_N, N, P, NUMPPV, 1.0D0, PPV(:,:), N, X(:,:), NUMPPV, 1.0D0, V, N)
!\$ACC END HOST_DATA

IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
status = cublasDestroy(zandle)
! -----------------------------------------------------------------------

END SUBROUTINE FUNC
``````

The Makefile:

FC=nvfortran
TIMER=/usr/bin/time
OPT=
NOPT=-fast -Minfo=opt \$(OPT)
FCFLAGS = -Mpreprocess -acc=gpu -cuda -cudalib=cublas

main: main.o func.0
\$(FC) \$(FCFLAGS) -o main main.o func.o \$(NOPT) -ta:tesla:cc70 -Minfo=accel -acc -omp

main.o: main.f90 func.o
\$(FC) -c main.f90

func.o: func.f90
\$(FC) -c func.f90

clean:
rm -f *.o *.exe *.s *.mod a.out

I call dgemm several times in my code and it accumulates to a considerable time. It correctly multiplies the matrices yet why the first call takes more time bothers me while the dimensions of the multiplications are the same. Do you know why the first subroutine takes more time than the second?

Yunus

Hi Yunus,

It’s the time to initialize the cuBlas library. You can add a call to cublasInit before your timer to move when the initialization is perform so it doesn’t show up in your timing, but the overall time of the application would be the same.

``````    ! CUDA VARIABLES
TYPE(cublasHandle) :: handle
TYPE(cublasHandle) :: zandle

INTEGER :: status
status = cublasInit()

``````

Hope this helps,
Mat

Hi Mat,

The code loops for 100 times and it calls 2 times cuBLAS at each of this 100, and it initializes 100 times the cuBLAS library. Is it the way in which it does?
Is there a way to initialize it once and keep it throughout the entire loop?

Thank you,
Yunus

I believe the initialization only needs to occur once (assuming you don’t call cublasShutdown). I only put it there to show that the time no longer appear in your timer output. You should be able to move the call early in the program.

Okay. I’ll try it. Thank you for your attention.