Why tiled MMM can only achieve around 40GFLOPS ?

Hella_Yu · March 20, 2008, 6:58pm

MMM -matrix matrix multiplication.

I see this 40GFLOPS from some ppt, saying 16x16 tiled implementation can achieve around 40GFLOPS, compared to some naiive untiled implementation of 17GLOPS.
My question is why even tiled can only achieve 1/10 of the peak? Did anybody measure the performance of the MMM example in CUDA SDK?

Thanks

DenisR · March 20, 2008, 7:50pm

well, memory access can be a reason. You have 2 things that limit your performance :

peak GFLOPS
peak GB/s

if you hit either of them you cannot go faster (then it will be no use optimizing your code)

eelsen · March 20, 2008, 7:56pm

Just in case you aren’t aware, the supplied matrix multiplication routine in CUBLAS gets more than 100Gflops. I believe you can download the source code somewhere, if you want to see the real high performance routine.

It is most likely more obtuse than the example in the SDK.

Hella_Yu · March 20, 2008, 10:02pm

I think memory access should not be a problem in tiled case, because comp/comm is 16, so 80GB/s bandwidth should be able to perform 320GFLOPS in ideal case.

Hella_Yu · March 20, 2008, 10:04pm

Thanks . that 's helpful…

so seems it’s hard to use CUDA high level API to get the good performance out of G80, and far from peak even for so regular algorithm as MMM…

mfatica · March 20, 2008, 11:17pm

Look for the CUDA code for SGEMM posted by volkov, it achieves more than 200 Gflops on G80.

paulius · March 21, 2008, 3:05am

Also, MMM is not as simple as it may look on any architecture if you want to achieve high performance. As an exercise, have a try at optimizing it for CPU yourself and then check out the performance of Intel’s MKL MMM, which will beat your code (proabably pretty badly). On the same C2D CPU, MKL’s sgemm was about 25% faster than my (non-SSE) implementation of Strassen’s algorithm, which is O(n^2.7), for n = 2048.

Paulius

nasacort · March 24, 2008, 8:47pm

See my recent topic ‘strange FLOP counts’ on this forum.

Topic		Replies	Views
matrix multiplication can't achieve peak performanc CUDA Programming and Performance	9	2312	April 19, 2012
What's the measured perf. of MMM with lowlevel API CUDA Programming and Performance	0	2137	April 14, 2008
benchmark CUDA CuBLas and OpenCL CUDA Programming and Performance	13	28051	February 1, 2011
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6574	April 16, 2007
Matrix Multiplication Throughput CUDA Programming and Performance	2	1029	July 27, 2010
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18202	March 30, 2011
Tiled Matrix Multiplication Vastly Slower Than Simple Matrix Multiplication CUDA Programming and Performance	4	1986	January 26, 2022
Matrix Multiplication (GEMM) Register Tiling and Shared Memory Bandwidth Bound CUDA Programming and Performance kernel	4	1598	December 18, 2022
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11596	January 28, 2010
Tiled matrix multiplication is slower CUDA Programming and Performance	2	1332	October 12, 2021

Why tiled MMM can only achieve around 40GFLOPS ?

Related topics