cublas sgemm,dgemm performance issue on telsa 10 and gtx 570

Dear all,

I suspect that I do something wrong, and that is why get poor performance results. Please point out where and why.

All I wanted is to write a simple code to test cblas_dgemm() on my two available arch: tesla 10 and gtx 570.

We have officially that sgemm and dgemm on tesla 10 is around 330GFlops and 70GFlops, respectively.
web:www.siliconmechanics.com/files/C2050Benchmarks.pdf [page 5]

What I get: 190GFlops and 41GFlops

--start--[dim=2048,  Double A,B,X size=100663 KB,  Float A,B,X size=50331 KB ]--
 --float_cublas--[iter=100]
   Init transfering: A,B time=5.80294ms done.
 --end_for--[ time=4516.17ms    iFlop=858993459Kflop done. perf=190204MFLOPS]--
 --double_cublas--[iter=100]
   Init transfering: A,B time=11.6668ms done.
 --end_for--[ time=20718.7ms    iFlop=858993459Kflop done. perf=41459.8MFLOPS]--

Lets enumerate my questions/remarks:

  1. Matrix*Matrix operation should not bandwidth limited, as O(N*N*N) has to computed, and O(N*N) has to be copied.
  2. Double performance should be around the theoretical 78 GFlops which is the case for the weblink, however not for me
  3. I think the problem is in my code, maybe in my time or flops measurement... therefore i give you my code below
  4. Maybe i compile it wrong or my Cuda version... so I also supply it
  5. I suspiciously get around half of the performance, is N*N*N okay for flops?

g++ -W -Wall -O2 -fno-strict-aliasing -c -o testcublas.o testcublas.cpp
g++ -o testcublas testcublas.o -L/home/csnemes/NVIDIA_GPU_Computing_SDK/C/lib/ -L/home/csnemes/NVIDIA_GPU_Computing_SDK/shared/lib/ -lcudart -lcublas -lcutil_x86_64 -lshrutil_x86_64

~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16
23:32:08 PDT 2011
GCC version: gcc version 4.3.4 (Debian 4.3.4-10)

Device 0: "Tesla T10 Processor"
CUDA Driver Version: 4.0

code: http://digitus.itk.ppke.hu/~nemcs/testcublas.cpp

Sorry,

I am answering my question:
madd/fmac = 2 FLOPS

So the correct code is

iFlop+=((N*N*N)*2)