cublas sgemm,dgemm performance issue on telsa 10 and gtx 570

csnemes · February 24, 2013, 11:48am

Dear all,

I suspect that I do something wrong, and that is why get poor performance results. Please point out where and why.

All I wanted is to write a simple code to test cblas_dgemm() on my two available arch: tesla 10 and gtx 570.

We have officially that sgemm and dgemm on tesla 10 is around 330GFlops and 70GFlops, respectively.
web:www.siliconmechanics.com/files/C2050Benchmarks.pdf [page 5]

What I get: 190GFlops and 41GFlops

--start--[dim=2048,  Double A,B,X size=100663 KB,  Float A,B,X size=50331 KB ]--
 --float_cublas--[iter=100]
   Init transfering: A,B time=5.80294ms done.
 --end_for--[ time=4516.17ms    iFlop=858993459Kflop done. perf=190204MFLOPS]--
 --double_cublas--[iter=100]
   Init transfering: A,B time=11.6668ms done.
 --end_for--[ time=20718.7ms    iFlop=858993459Kflop done. perf=41459.8MFLOPS]--

Lets enumerate my questions/remarks:

Matrix*Matrix operation should not bandwidth limited, as O(N*N*N) has to computed, and O(N*N) has to be copied.
Double performance should be around the theoretical 78 GFlops which is the case for the weblink, however not for me
I think the problem is in my code, maybe in my time or flops measurement... therefore i give you my code below
Maybe i compile it wrong or my Cuda version... so I also supply it
I suspiciously get around half of the performance, is N*N*N okay for flops?

g++ -W -Wall -O2 -fno-strict-aliasing -c -o testcublas.o testcublas.cpp
g++ -o testcublas testcublas.o -L/home/csnemes/NVIDIA_GPU_Computing_SDK/C/lib/ -L/home/csnemes/NVIDIA_GPU_Computing_SDK/shared/lib/ -lcudart -lcublas -lcutil_x86_64 -lshrutil_x86_64

~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 270.41.19 Mon May 16
23:32:08 PDT 2011
GCC version: gcc version 4.3.4 (Debian 4.3.4-10)

Device 0: "Tesla T10 Processor"
CUDA Driver Version: 4.0

code: http://digitus.itk.ppke.hu/~nemcs/testcublas.cpp

csnemes · February 24, 2013, 2:30pm

Sorry,

I am answering my question:
madd/fmac = 2 FLOPS

So the correct code is

iFlop+=((N*N*N)*2)

Topic		Replies	Views
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4280	January 14, 2010
my speedy SGEMM CUDA Programming and Performance	91	275972	May 29, 2013
cublas problem with very big matrixes and cublasDgemm slow CUDA Programming and Performance	2	1004	February 23, 2017
CUBLAS dgemm performance query CUDA Programming and Performance	4	2052	January 12, 2012
Slow CUDA SGEMM CUDA Programming and Performance	5	685	September 15, 2022
cublasSgemv performance question GPU-Accelerated Libraries	5	925	December 10, 2018
CUBLAS Performance Many algorithms perform abysmally CUDA Programming and Performance	6	7624	February 3, 2008
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	953	August 23, 2018
CUBLAS SGEMM performance CUDA Programming and Performance	5	10705	October 5, 2007
cublasSgemm gives incorrect result with big matrix CUDA Programming and Performance cuda	1	440	June 28, 2020

cublas sgemm,dgemm performance issue on telsa 10 and gtx 570

Related topics