Comparing DGEMM - Intel MKL and Cublas

mSSM · September 6, 2010, 10:30pm

I need to compare the speed of matrix multiplications between a CPU and an nVidia graphics card.
The CPU is an Intel(R) Xeon(R) CPU E5630 @ 2.53GHz, and the graphics card is a Quadro FX 4800. The intend was to significantly speed up our code base, where the main bottle neck are matrix multiplications (taking up to 99% of the runtime). The machine is the same in both cases, just that in one case I run dgemm provided by ifort/mkl/blas on the host, and in the other I run cublas_dgemm provided by pgfortran/cuda/cublas on the device.

For the host-based matrix multiplication I employ the basic BLAS routine, using the Intel Fortran compiler ifort 11.1:

call dgemm ('N','N',m,n,k,alpha,a,m,b,k,beta,c,m)

I compile my code with lots of optimizations (essentially, Intel MKL is being run threaded and in parallel, as I understood the Intel user guide, as this is the way the library is defined):

ifort -align all -pad -O3 -parallel -openmp -assume cc_omp -threads -march=core2 -mtune=core2 -xT -pc 64 -L/opt/intel/mkl/lib/em64t -lmkl_em64t -lguide

For the device-based matrix multiplication, I call the CUBLAS routine, using CUDA version 3.0; I have the following interface:

module cublas
 interface cuda_gemm
  subroutine cuda_dgemm(cta, ctb, m, n, k,&
   alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm')
   use iso_c_binding
   character(1,c_char),value :: cta, ctb
   integer(c_int),value :: m,n,k,lda,ldb,ldc
   real(c_double),value :: alpha,beta
   real(c_double), device, dimension(lda,*) :: A
   real(c_double), device, dimension(ldb,*) :: B
   real(c_double), device, dimension(ldc,*) :: C
  end subroutine cuda_dgemm
 end interface
end module cublas

And I call this via:

call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)

And the code is compiled like this:

 pgfortran -Mcuda -L/opt/cuda/cuda/lib64 -lcublas

The matrices I am multiplying are simple NxN matrices (# rows = # columns), all filled with 1.0 (real, double precision); it looks like the following (basically, both things look the same, just that for CUBLAS dgemm I copy all the arrays to the device first, of course):

do n=100,5000,100
 m=n
 k=n
 ! array allocation host
 allocate (a(m,k),b(k,n),c(m,n),stat=alloc)
 ! array allocation device
 allocate (a_d(m,k),b_d(k,n),c_d(m,n),stat=alloc)
 ! fill arrays:
 a=1
 b=1
 c=0
 ! copying arrays from host to device; CUBLAS test only:
 a_d=a
 b_d=b
 c_d=c
 !....
 ! copying arrays back from device to host
 !....
 c=c_d
 ! deallocating all arrays
 deallocate (a,b,c)
 ! only for device
 deallocate (a_d,b_d,c_d)
end do

I measure the real time it takes for a MKL/BLAS run to be completed, and for the CUBLAS run to be completed (including copying arrays from host to device, and back). I am really disappointed with the results. The CUBLAS routine does not provide a speed up, which I hoped to achieve. See the figure below (x axis shows the the size of N, and y the real time it takes in seconds):

For one, I cannot really explain the outliers, i.e. why on some occasions the CUBLAS dgemm run is significantly faster than anything else (considering that I deallocate and reallocate the arrays each single time). Any ideas?

I also wonder if there is a way to quantify the time it takes to copy arrays between host and device (as this seems to be a major bottleneck for large arrays)?

Also, is there maybe some way to optimze my pgfortran/CUDA code and the code compilation, respectively?

TheMatt · September 7, 2010, 12:02pm

First, you can indeed time the data transfers and probably should. I usually do this with the CUDA API timers:

type(cudaEvent) :: dataxfer_start, dataxfer_end
...
istat = cudaEventCreate(dataxfer_start)
istat = cudaEventCreate(dataxfer_end)
...
! Copy arrays from host to device
istat = cudaEventRecord(dataxfer_start,0)
<do the copies>
istat = cudaEventRecord(dataxfer_end,0)
istat = cudaThreadSynchronize()
istat = cudaEventElapsedTime(dataxfer_time,dataxfer_start,dataxfer_end)
data_time = dataxfer_time

I then accumulate data_time over all the dataxfer_* sections. (Many of the “super” results I’ve seen with CUDA are ex-data transfer speedups since, for large matrices, that time is significant. But not always. Big enough problems, the data transfer cost is well worth the speedup on the GPU.)

As for performance, I haven’t used CUBLAS much outside of experimenting, but there are a couple reasons you probably shouldn’t expect miraculous results. First, double-precision math on a non-Fermi board, like your FX 4800, runs at, at best, 1/8 the speed of single-precision math. I imagine if you used SGEMM rather than DGEMM you’d see much better speedups. Second, I have to guess that ifort+MKL on a Westmere has a mighty fine DGEMM implementation as one of the key BLAS routine. Run on 4 cores (so, 8 threads, perhaps)…I can see it giving a FX 4800 a run for its money. (In fact, many CUDA DGEMM (non-Fermi) implementations I’ve seen in papers do joint GPU+CPU execution. Load the GPU with its section of the matrices, call the kernel, pass immediately back to the CPU, and then do work on the CPU.)

Of course, if you need double precision math, then you’d be better off using a Fermi board as they have 4x the DP units compared to a non-Fermi. That’s one reason Fermis were so anticipated by many scientists.

mSSM · September 9, 2010, 5:42pm

I was just thinking:
Can I time the data transfer not using the CUDA API?

Right now I am timing parts of my program getting the real time (i.e. via a call to the system_clock).

Can I be sure that after an assignment such as the following two the transfer process is done before the next command (such as calling system_clock again)?

a_d=a ! copy from host to device
c_d=c ! copy from device to host

MatColgrove · September 9, 2010, 10:17pm

Hi mSSM,

I’d be interested in seeing how you’re timing your kernel call. Kernel calls are asynchronous so your host code will continue executing until it reaches more device code, such as a data transfer or a call to cudaThreadSynchronize.

As Matt shows, it’s best to use CUDA Events for timers as well as calls to cudaThreadSynchronize to ensure that all threads have completed before getting the time. I posted a more complete example in another post (Do loop inside kernel) that you might find helpful.

Mat