I need to compare the speed of matrix multiplications between a CPU and an nVidia graphics card.

The CPU is an Intel(R) Xeon(R) CPU E5630 @ 2.53GHz, and the graphics card is a Quadro FX 4800. The intend was to significantly speed up our code base, where the main bottle neck are matrix multiplications (taking up to 99% of the runtime). The machine is the same in both cases, just that in one case I run dgemm provided by ifort/mkl/blas on the host, and in the other I run cublas_dgemm provided by pgfortran/cuda/cublas on the device.

For the host-based matrix multiplication I employ the basic BLAS routine, using the Intel Fortran compiler ifort 11.1:

```
call dgemm ('N','N',m,n,k,alpha,a,m,b,k,beta,c,m)
```

I compile my code with lots of optimizations (essentially, Intel MKL is being run threaded and in parallel, as I understood the Intel user guide, as this is the way the library is defined):

```
ifort -align all -pad -O3 -parallel -openmp -assume cc_omp -threads -march=core2 -mtune=core2 -xT -pc 64 -L/opt/intel/mkl/lib/em64t -lmkl_em64t -lguide
```

For the device-based matrix multiplication, I call the CUBLAS routine, using CUDA version 3.0; I have the following interface:

```
module cublas
interface cuda_gemm
subroutine cuda_dgemm(cta, ctb, m, n, k,&
alpha, A, lda, B, ldb, beta, c, ldc) bind(C,name='cublasDgemm')
use iso_c_binding
character(1,c_char),value :: cta, ctb
integer(c_int),value :: m,n,k,lda,ldb,ldc
real(c_double),value :: alpha,beta
real(c_double), device, dimension(lda,*) :: A
real(c_double), device, dimension(ldb,*) :: B
real(c_double), device, dimension(ldc,*) :: C
end subroutine cuda_dgemm
end interface
end module cublas
```

And I call this via:

```
call cuda_gemm ('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)
```

And the code is compiled like this:

```
pgfortran -Mcuda -L/opt/cuda/cuda/lib64 -lcublas
```

The matrices I am multiplying are simple NxN matrices (# rows = # columns), all filled with 1.0 (real, double precision); it looks like the following (basically, both things look the same, just that for CUBLAS dgemm I copy all the arrays to the device first, of course):

```
do n=100,5000,100
m=n
k=n
! array allocation host
allocate (a(m,k),b(k,n),c(m,n),stat=alloc)
! array allocation device
allocate (a_d(m,k),b_d(k,n),c_d(m,n),stat=alloc)
! fill arrays:
a=1
b=1
c=0
! copying arrays from host to device; CUBLAS test only:
a_d=a
b_d=b
c_d=c
!....
! copying arrays back from device to host
!....
c=c_d
! deallocating all arrays
deallocate (a,b,c)
! only for device
deallocate (a_d,b_d,c_d)
end do
```

I measure the real time it takes for a MKL/BLAS run to be completed, and for the CUBLAS run to be completed (including copying arrays from host to device, and back). I am really disappointed with the results. The CUBLAS routine does not provide a speed up, which I hoped to achieve. See the figure below (x axis shows the the size of N, and y the real time it takes in seconds):

For one, I cannot really explain the outliers, i.e. why on some occasions the CUBLAS dgemm run is significantly faster than anything else (considering that I deallocate and reallocate the arrays each single time). Any ideas?

I also wonder if there is a way to quantify the time it takes to copy arrays between host and device (as this seems to be a major bottleneck for large arrays)?

Also, is there maybe some way to optimze my pgfortran/CUDA code and the code compilation, respectively?