1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.

I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.

At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.

2, Does this mean 1/3 is the best we can do for this matrix application problem.
3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.

It largely depends on how much effort you are willing to invest: [1][2].

Having ptxas in between the compiler output (or your hand-optimized PTX code) and the instructions that actually run on the machine can be quite painful sometimes.

Hi,
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:

naive implementation without shared memory: 110 GFlops

implementation with shared memory (as in programming guide); 230 GFlops

Cublas : 840 GFlops

Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.

the peak performance for gtx480 on the wiki is 1344Gflops, but I think is obtained by 3flops*number of core * clock in Ghz.

When I think the performance for matrix muliplication, should we compare with 3flops*number of core * clock or with

2flops*number of core * clock.

If you think about 3flop/core, then 840G is 62 percent of peak performance.

If we use 2 flops/core, then 840Gflop is nearly 95% percent of peak performance.

( I saw one paper "Improving Performance of Matrix Multiplication and FFT on GPU ", claimed GTX 280 peak performance is (2 flops *240 * 1.295 GHz)=622 Gflops instead of 933GB claimed by nvidia)

Can somebody explained more clearly why use 2flops*number of core * clock for performance instead of 3flops.

Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— number of cores Ã— frequency.

Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.

Yes, it’s the shader frequency, not the core frequency.
(another case where Nvidia’s renaming of the ALUs/FPUs to “cores” creates unnecessary confusion)