1, I am reading the book programming_massively_parallel_processors by Kirk and Wu.
I am wondering why only 1/3 of peak performance is achieved for the matrix multiplcation problem by the top scientist in the area.
At the end of the book, they achieved 120GFlops, but the peak performance is 360GFlops for the device.
2, Does this mean 1/3 is the best we can do for this matrix application problem.
3, For overall real applications, what is teh best percentage of peak performance we can expect. I know this is related to the problem, just want to get some thumb of rule ideas.
i have done some experiments around matrix multiplication with my students and i obtain the following performance on a GTX480 with Cuda 4.1:
naive implementation without shared memory: 110 GFlops
implementation with shared memory (as in programming guide); 230 GFlops
Cublas : 840 GFlops
Obviously the matrix should be big enough (> 1024x1024).
An important point is how you compute the flops; a matrix multiplication with the standard algorithm (as implemented in version 1 and 2) requires
n^3 multiplications and n^3 additions so 2n^3 flop.
Peak performance for GPUs of compute capability >= 2.0 is 2 FLOP/cycle Ã— number of cores Ã— frequency.
Devices of compute capability 1.x could under very special conditions perform an extra multiplication per cycle in the special function units. However, as matrix multiplication has a 1:1 ratio of additions and multiplications, this cannot be exploited for matrix multiplication. So later compute capabilities dropped that feature.