I need to compare the speed of matrix multiplications between a CPU and an nVidia graphics card.
The CPU is an Intel(R) Xeon(R) CPU E5630 @ 2.53GHz, and the graphics card is a Quadro FX 4800. The intend was to significantly speed up our code base, where the main bottle neck are matrix multiplications (taking up to 99% of the runtime). The machine is the same in both cases, just that in one case I run dgemm provided by ifort/mkl/blas on the host, and in the other I run cublas_dgemm provided by pgfortran/cuda/cublas on the device.
For the host-based matrix multiplication I employ the basic BLAS routine, using the Intel Fortran compiler ifort 11.1:
call dgemm ('N','N',m,n,k,alpha,a,m,b,k,beta,c,m)
I compile my code with lots of optimizations (essentially, Intel MKL is being run threaded and in parallel, as I understood the Intel user guide, as this is the way the library is defined):
ifort -align all -pad -O3 -parallel -openmp -assume cc_omp -threads -march=core2 -mtune=core2 -xT -pc 64 -L/opt/intel/mkl/lib/em64t -lmkl_em64t -lguide
For the device-based matrix multiplication, I call the CUBLAS routine, using CUDA version 3.0; I have the following interface:
module cublas interface cuda_gemm subroutine cuda_dgemm(cta, ctb, m, n, k,& alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm') use iso_c_binding character(1,c_char),value :: cta, ctb integer(c_int),value :: m,n,k,lda,ldb,ldc real(c_double),value :: alpha,beta real(c_double), device, dimension(lda,*) :: A real(c_double), device, dimension(ldb,*) :: B real(c_double), device, dimension(ldc,*) :: C end subroutine cuda_dgemm end interface end module cublas
And I call this via:
call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)
And the code is compiled like this:
pgfortran -Mcuda -L/opt/cuda/cuda/lib64 -lcublas
The matrices I am multiplying are simple NxN matrices (# rows = # columns), all filled with 1.0 (real, double precision); it looks like the following (basically, both things look the same, just that for CUBLAS dgemm I copy all the arrays to the device first, of course):
do n=100,5000,100 m=n k=n ! array allocation host allocate (a(m,k),b(k,n),c(m,n),stat=alloc) ! array allocation device allocate (a_d(m,k),b_d(k,n),c_d(m,n),stat=alloc) ! fill arrays: a=1 b=1 c=0 ! copying arrays from host to device; CUBLAS test only: a_d=a b_d=b c_d=c !.... ! copying arrays back from device to host !.... c=c_d ! deallocating all arrays deallocate (a,b,c) ! only for device deallocate (a_d,b_d,c_d) end do
I measure the real time it takes for a MKL/BLAS run to be completed, and for the CUBLAS run to be completed (including copying arrays from host to device, and back). I am really disappointed with the results. The CUBLAS routine does not provide a speed up, which I hoped to achieve. See the figure below (x axis shows the the size of N, and y the real time it takes in seconds):
For one, I cannot really explain the outliers, i.e. why on some occasions the CUBLAS dgemm run is significantly faster than anything else (considering that I deallocate and reallocate the arrays each single time). Any ideas?
I also wonder if there is a way to quantify the time it takes to copy arrays between host and device (as this seems to be a major bottleneck for large arrays)?
Also, is there maybe some way to optimze my pgfortran/CUDA code and the code compilation, respectively?