cudaMemcpy slows down cublas_Get_Matrix after cublas_dGEMM operations

I am developing a Fortran 2003 library for deep learning using CUDA. The problem I have noticed is that basically the code is slower than CPU blas routines because cudaMemcpy which comes from cublas_Get_Matrix routine takes a long time. I have written a small Fortran program to show the issue:

MODULE timermodule
   TYPE, PUBLIC :: timer
      REAL(KIND=dp1) :: saved_time
      CHARACTER (len=100) :: mes=""
      PROCEDURE, PUBLIC :: tic => gettime
      PROCEDURE, PUBLIC :: toc => givetime
   END TYPE timer
   PRIVATE :: gettime , givetime
   SUBROUTINE gettime(start , mes)
      CLASS(timer),INTENT(OUT) :: start
      CHARACTER(len=*) , INTENT(in) , OPTIONAL :: mes
      INTEGER, DIMENSION(8) :: time1
      CALL DATE_AND_TIME(values=time1)
      IF (PRESENT(mes)) THEN
         start%mes="this part"
   SUBROUTINE givetime(last_time)
      CLASS(timer), INTENT(IN) :: last_time
      INTEGER, DIMENSION(8) :: time1
      REAL(KIND=dp1) :: stoptime, givetime1
      INTEGER :: mnt
      REAL(kind=dp1) :: scnd
      CALL DATE_AND_TIME(values=time1)
      IF (mnt < 1) THEN
         WRITE (*,100) TRIM(last_time%mes),givetime1
      ELSEIF (mnt==1) THEN
         WRITE (*,200) TRIM(last_time%mes),scnd
         WRITE (*,300) TRIM(last_time%mes),mnt,scnd
100   FORMAT ('  The time spent on ' , A , ' was ' , f6.3 , ' seconds.' )
200   FORMAT ('  The time spent on ' , A , ' was 1 minute and ' , &
           f5.2 , ' seconds.')
300   FORMAT ('  The time spent on ' , A , ' was ' , I2 , &
           ' minutes and ' , f5.2 , ' seconds.')
   END SUBROUTINE givetime
END MODULE timermodule

PROGRAM example_dgemm
   USE iso_c_binding
   USE timermodule
   ! Define 3 double precision matrices A, B, C
   INTEGER, PARAMETER :: dp=selected_real_KIND(14)
   REAL(dp), DIMENSION(:,:),ALLOCATABLE:: a,b,c,c2,c3
   INTEGER*8:: devPtrA, devPtrB, devPtrC
   INTEGER:: j, m=5000,k=6000,n=7000 , size_of_real=8 , rounds=1
   TYPE(timer) :: t
   PRINT*,'================',m,'x',k,'x',n, "=================="
   ALLOCATE (A(m,k),B(k,n),C(m,n),c2(m,n),c3(m,n))
   CALL random_NUMBER(a)
   CALL random_NUMBER(b)
   CALL random_NUMBER(c)

   PRINT*,'-------------------GPU DGEMM---------------------'

   CALL cublas_Alloc(m*k,size_of_real, devPtrA)
   CALL cublas_Alloc(k*n,size_of_real, devPtrB)
   CALL cublas_Alloc(m*n,size_of_real, devPtrC)
   ! Copy data to GPU
   CALL cublas_Set_Matrix(m,k,size_of_real,A,m,devPtrA,m)
   CALL cublas_Set_Matrix(k,n,size_of_real,B,k,devPtrB,k)
   CALL cublas_Set_Matrix(m,n,size_of_real,C,m,devPtrC,m)
   CALL t%tic("GPU DGEMM")
   DO j=1,rounds
      CALL cublas_dGEMM('n','n', m,n,k,1._dp,devPtrA,m,devPtrB,k,1._dp,devPtrC,m)
   CALL cublas_Get_Matrix(m,n,size_of_real,devPtrC,m,C,m)
   CALL t%toc
   CALL cublas_free(devPtrA)
   CALL cublas_free(devPtrB)
   CALL cublas_free(devPtrC)
   PRINT *,SUM(c(m,:))

   CALL t%tic("MATMUL")
   DO j=1,rounds
   CALL t%toc
   PRINT*, SUM(c2(m,:))

   PRINT*,'-------------------CPU DGEMM---------------------'
   CALL t%tic("CPU DGEMM")
   DO j=1,rounds
      CALL dGEMM('n','n', m,n,k,1._dp,a,m,b,k,1._dp,C3,m)
   CALL t%toc

   DEALLOCATE (a,b,c,c2,c3)

END PROGRAM example_dgemm

And this is the profiling result:

==19395== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.62%  9.50369s         1  9.50369s  9.50369s  9.50369s  maxwell_dgemm_128x64_nn
                    1.78%  173.26ms         4  43.315ms  1.4080us  66.976ms  [CUDA memcpy HtoD]
                    0.60%  58.455ms         1  58.455ms  58.455ms  58.455ms  [CUDA memcpy DtoH]
      API calls:   96.40%  9.73593s         5  1.94719s  12.564us  9.56227s  cudaMemcpy
                    2.88%  290.77ms         7  41.538ms  3.3400us  289.91ms  cudaMalloc
                    0.71%  71.525ms         8  8.9406ms     307ns  61.490ms  cudaFree
                    0.00%  346.42us       191  1.8130us     111ns  143.69us  cuDeviceGetAttribute
                    0.00%  278.87us         2  139.44us  37.244us  241.63us  cuDeviceGetName
                    0.00%  169.56us         2  84.781us  76.575us  92.988us  cuDeviceTotalMem
                    0.00%  32.633us         1  32.633us  32.633us  32.633us  cudaLaunchKernel
                    0.00%  8.8360us        18     490ns     360ns  1.7670us  cudaEventCreateWithFlags
                    0.00%  4.4810us         1  4.4810us  4.4810us  4.4810us  cuDeviceGetPCIBusId
                    0.00%  4.0610us        11     369ns     222ns     916ns  cudaDeviceGetAttribute
                    0.00%  1.6310us         4     407ns     163ns     978ns  cuDeviceGetCount
                    0.00%  1.3070us         1  1.3070us  1.3070us  1.3070us  cuInit
                    0.00%  1.0780us         1  1.0780us  1.0780us  1.0780us  cudaGetDevice
                    0.00%     722ns         3     240ns     131ns     418ns  cuDeviceGet
                    0.00%     514ns         2     257ns     252ns     262ns  cuDeviceGetUuid
                    0.00%     504ns         1     504ns     504ns     504ns  cuDriverGetVersion
                    0.00%     203ns         1     203ns     203ns     203ns  cudaGetLastError

On my system which sports a Quadro P1000 connected via an x16 interface the timings are (the output of the program):

 ================        5000 x        6000 x        7000 ==================
 -------------------GPU DGEMM---------------------
           1 done.
  The time spent on GPU DGEMM was  9.588 seconds.

  The time spent on MATMUL was 16.606 seconds.

 -------------------CPU DGEMM---------------------
  The time spent on CPU DGEMM was  4.676 seconds.

gcc version 8.4 along with CUDA 10.2.89 were used in this program.

An interesting note is that if you change the “rounds” parameter in the program to something higher like 5 or 10, the cublas_dgemm is executed rapidly but the final cublas_get_matrix takes so much time like it is going to calculate all the rounds of DGEMM.

A side question: Do programs like TensorFlow that use CUDA do all of the calculations on the GPU? I mean once the layers are set up and taken to the GPU, they never leave it until the final results? If this were the case, there would be no point in doing Fortran as anyway one would have to write a complete kernel. I was wondering if I can use CUDA only for the matrix calculations and then do the rest using Fortran, but considering the problem with cudaMemcpy it seems it is not worth doing so. I would be grateful if someone could give me some insight on this.


Please share your thoughts, as It would be great to listen to the others’ advice before totally changing my approach and hardware selection.