I see this 40GFLOPS from some ppt, saying 16x16 tiled implementation can achieve around 40GFLOPS, compared to some naiive untiled implementation of 17GLOPS.
My question is why even tiled can only achieve 1/10 of the peak? Did anybody measure the performance of the MMM example in CUDA SDK?
Just in case you aren’t aware, the supplied matrix multiplication routine in CUBLAS gets more than 100Gflops. I believe you can download the source code somewhere, if you want to see the real high performance routine.
It is most likely more obtuse than the example in the SDK.
I think memory access should not be a problem in tiled case, because comp/comm is 16, so 80GB/s bandwidth should be able to perform 320GFLOPS in ideal case.
Also, MMM is not as simple as it may look on any architecture if you want to achieve high performance. As an exercise, have a try at optimizing it for CPU yourself and then check out the performance of Intel’s MKL MMM, which will beat your code (proabably pretty badly). On the same C2D CPU, MKL’s sgemm was about 25% faster than my (non-SSE) implementation of Strassen’s algorithm, which is O(n^2.7), for n = 2048.