Is it possible to have a better performance for an algorithm in comparison to GEMM peak performance?

nokanaran · November 11, 2022, 2:45pm

I have implemented a gaussian elimination algorithm and I am seeing the peak performance is ~5% higher than GEMM peak performance and 10% less than theoretical peak performance.

Theoretical Peak 8.2 TF
GEMM Peak 7 TF
My Algorithm 7,3 TF

Could it be happen or is it strange?

As you know inside of gaussian elimination we are using GEMS for more than 90% of execution time, but not just once and we are doing multiplication many times.

njuffa · November 13, 2022, 10:19am

“Beating the vendor GEMM” is a game that is at least three decades old. Keep in mind that “GEMM” can comprise dozens of implementations under the hood, based on matrix sizes, matrix aspect ratios, matrix transpose modes, operand precision, processor architecture, etc., etc.

Usually it is possible for a determined researcher/programmer to beat the performance of some of the numerous flavors of GEMM involved.

The two other sub-claims (90% efficiency relative to theoretical) and GE achieving higher throughput than GEMM raise the level of skepticism on my part. That is (based on my own experience) an extraordinarily high percentage of theoretical peak, though it is not clear which GPU is being used. Eight TLOPS suggest some middle-of-the-road GPU.

Because GE typically uses GEMM, and GEMM is typically the highest-performing part and is slowed down by the other parts of GE, it seems odd that GE would achieve higher computational throughput than GEMM itself.

So I would say: Extraordinary claims require extraordinary evidence. Before you send off your manuscript for publication, triple check all timing measurements and FLOPS computations, make sure the code is functioning correctly (if results are allowed to be “wrong”, we can make code arbitrarily fast). Is the number of floating-point operations estimated based on operator counts in the code, determine via a formula (like those in the LAPACK manual), or reported by the CUDA profiler?