MMM -matrix matrix multiplication.

I see this 40GFLOPS from some ppt, saying 16x16 tiled implementation can achieve around 40GFLOPS, compared to some naiive untiled implementation of 17GLOPS.

My question is why even tiled can only achieve 1/10 of the peak? Did anybody measure the performance of the MMM example in CUDA SDK?

Thanks