matrix multiplication can't achieve peak performanc

Dear All:

1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.
3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

Any answer will be hightly appreciated.


It largely depends on how much effort you are willing to invest: [1] [2].

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

  1. I’d recommend sticking with CUBLAS if you need matrix multiplication. NVIDIA improve its performance with every new release.

  2. It’s relatively easy to obtain 80%+ of peak memory throughput. And in most real-world applications, this is the limiting factor.

i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

  1. naive implementation without shared memory: 110 GFlops
  2. implementation with shared memory (as in programming guide); 230 GFlops
  3. Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.


the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.

When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with

2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.

If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.


Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle × number of cores × frequency.

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Then for gtx 580, which frequency I should use, if I use core freuqency 0.77Ghz,

then I will get 25120.774, I will only get 700Gflops, but they seem to claim, gtx 580 has 1580Gflops.

where am wrong? can you please write down how to calculate gflop for gtx580 using device specific number?

Yes, it’s the shader frequency, not the core frequency.
(another case where Nvidia’s renaming of the ALUs/FPUs to “cores” creates unnecessary confusion)

oh, I made mistak.

25121.54 will give me the correct result

It is very easy to improve these 230 Gflop/s to 480 Gflop/s - see slide 51 and on.

If you want to get CUBLAS performance, check sgemm_fermi*.cu in magma_1.1.0.tar.gz here: