I suspect that I do something wrong, and that is why get poor performance results. Please point out where and why.
All I wanted is to write a simple code to test cblas_dgemm() on my two available arch: tesla 10 and gtx 570.
We have officially that sgemm and dgemm on tesla 10 is around 330GFlops and 70GFlops, respectively.
web:www.siliconmechanics.com/files/C2050Benchmarks.pdf [page 5]
What I get: 190GFlops and 41GFlops
--start--[dim=2048, Double A,B,X size=100663 KB, Float A,B,X size=50331 KB ]-- --float_cublas--[iter=100] Init transfering: A,B time=5.80294ms done. --end_for--[ time=4516.17ms iFlop=858993459Kflop done. perf=190204MFLOPS]-- --double_cublas--[iter=100] Init transfering: A,B time=11.6668ms done. --end_for--[ time=20718.7ms iFlop=858993459Kflop done. perf=41459.8MFLOPS]--
Lets enumerate my questions/remarks:
- Matrix*Matrix operation should not bandwidth limited, as O(N*N*N) has to computed, and O(N*N) has to be copied.
- Double performance should be around the theoretical 78 GFlops which is the case for the weblink, however not for me
- I think the problem is in my code, maybe in my time or flops measurement... therefore i give you my code below
- Maybe i compile it wrong or my Cuda version... so I also supply it
- I suspiciously get around half of the performance, is N*N*N okay for flops?
g++ -W -Wall -O2 -fno-strict-aliasing -c -o testcublas.o testcublas.cpp
g++ -o testcublas testcublas.o -L/home/csnemes/NVIDIA_GPU_Computing_SDK/C/lib/ -L/home/csnemes/NVIDIA_GPU_Computing_SDK/shared/lib/ -lcudart -lcublas -lcutil_x86_64 -lshrutil_x86_64
~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16
23:32:08 PDT 2011
GCC version: gcc version 4.3.4 (Debian 4.3.4-10)
Device 0: "Tesla T10 Processor"
CUDA Driver Version: 4.0