About GPU peak performance

I am using T4, its specification said its peak tensor core fp16 peformance is about 65T.
Now, I wrote a demo in which I do a matrix test (matrix A is 40964096, matrxi B is also 40964096) using libcublas, by this way, I calculate the fp16 flops as follows:
flops = (4096 * 4096 * 4096 * 2)/(cost time)
I run 100 iterations and calculate the mean cost time.
The test result showed that T4’s real FP16 flops is about 24T, much less than its peak performance.
I noticed that T4’s base clock is 585MHz, boost clock is 1590MHz. It seemed that peak performance is got under boost clock and real work performance is got under base clock, is that correct?
If so, what is meaning of peak performance and how can I get peak performance?

The peak performance of any GPU is generally not achievable. There may be several reasons for this.

In the case of the T4, the performance of ~24T is arising due to power capping of the GPU. Power capping is a clock reduction strategy to make sure the GPU stays within its stated power limit, which in the case of the T4 is about 70W. It is expected behavior for approximately continuous matrix multiply operations on a T4. If you want to see something better, do a smaller op just once or twice and measure it. However you will not get to 65T that way.

Thanks for the reply, if so, what is the usage of peak performance announced in the specification a GPU product, if I want to evaluate computing task’s performance, I cannot do it based on peak performance, right?

You cannot evaluate throughput based on peak performance, that is correct. Peak performance is generally not achievable.

Then how to calculate stable performance? Is it sensible to calculate it by using (base clock / boost clock) * peak_performance?

I don’t know what stable performance is.

The best suggestion I have is to benchmark.

In general this is theoretical peak performance, simply based on various machine parameters, for example: [number of execution units] * [number of operations that can be completed per execution unit per clock cycle] * [maximum operating frequency].

In general, theoretical peak performance is not achievable, for any processor or platform, due to any number of limiting factors. This might include execution units not being able to be supplied with data at the required rate in a practical use case (e.g. due to register file read bandwidth) or failure to achieve maximum operating frequency due to insufficient power supply and / or cooling.

As a rule of thumb, in many practical scenarios practicably achievable peak performance is on the order of 75% to 85% of theoretical peak performance, and that applies to computational throughput, memory throughput, and interconnect throughput. Only benchmarking in the context of a specific use case can reveal the applicable practically achievable peak performance for that use case.

Why does technical literature often quote theoretical peak performance? Based on my industry experience this is because marketing folks (including those in technical marketing) glom onto the highest number they come across. Plus theoretical peak performance can often be determined by simply multiplying a few factors.