Why tiled MMM can only achieve around 40GFLOPS ?

MMM -matrix matrix multiplication.

I see this 40GFLOPS from some ppt, saying 16x16 tiled implementation can achieve around 40GFLOPS, compared to some naiive untiled implementation of 17GLOPS.
My question is why even tiled can only achieve 1/10 of the peak? Did anybody measure the performance of the MMM example in CUDA SDK?

Thanks

well, memory access can be a reason. You have 2 things that limit your performance :

peak GFLOPS
peak GB/s

if you hit either of them you cannot go faster (then it will be no use optimizing your code)

Just in case you aren’t aware, the supplied matrix multiplication routine in CUBLAS gets more than 100Gflops. I believe you can download the source code somewhere, if you want to see the real high performance routine.

It is most likely more obtuse than the example in the SDK.

I think memory access should not be a problem in tiled case, because comp/comm is 16, so 80GB/s bandwidth should be able to perform 320GFLOPS in ideal case.

Thanks . that 's helpful…

so seems it’s hard to use CUDA high level API to get the good performance out of G80, and far from peak even for so regular algorithm as MMM…

Look for the CUDA code for SGEMM posted by volkov, it achieves more than 200 Gflops on G80.

Also, MMM is not as simple as it may look on any architecture if you want to achieve high performance. As an exercise, have a try at optimizing it for CPU yourself and then check out the performance of Intel’s MKL MMM, which will beat your code (proabably pretty badly). On the same C2D CPU, MKL’s sgemm was about 25% faster than my (non-SSE) implementation of Strassen’s algorithm, which is O(n^2.7), for n = 2048.

Paulius

See my recent topic ‘strange FLOP counts’ on this forum.