A faster implementation with faster speed gets lower compute and bandwidth throughput, why?

I’m implementing a gemm kernel. when I increase the tiling size, I compared two implementations.

The faster one get lower compute and bandwidth throughput, but why is that?

The faster one must have get higher compute throughput. Is that a bug?

And also, how can the speed of light section give advice to our optimization?
Seems it doesn’t always gets higher with the optimization.

The speed of light section shows a collection of high-level metrics, including throughput values for the compute and memory sub-systems of the GPU. In the respective breakdown tables in this section for compute and memory, you can see which of the underlying units is causing the measured utilization.

It is correct that these throughput metrics don’t necessarily increase or decrease by optimizing a kernel. They don’t give a reference of how well optimized your kernel is, but rather of how much it is using the HW resources to their limits (the “speed of light”). A kernel may be utilizing e.g. a compute pipeline close to its maximum, but algorithmically inefficient, in which case you will likely still not call this an optimized kernel. Since the tool has no insight into what the kernel is actually intended to do, it can only give you advice on how much each HW resource is used.

The expectation is to follow the guidance given in each section, starting from the top. By collecting the “full” set of sections, you will get a very comprehensive analysis of your kernel’s performance. Follow the links in the rule guidance to understand what your kernel’s limiters and bottlenecks are and what suggestions are made to resolve these.

Much clear about it. Thank you!