A faster implementation with faster speed gets lower compute and bandwidth throughput, why?

cuda_new_bird · August 6, 2023, 6:32am

I’m implementing a gemm kernel. when I increase the tiling size, I compared two implementations.

The faster one get lower compute and bandwidth throughput, but why is that?

The faster one must have get higher compute throughput. Is that a bug?

And also, how can the speed of light section give advice to our optimization?
Seems it doesn’t always gets higher with the optimization.

felix_dt · August 7, 2023, 8:54am

The speed of light section shows a collection of high-level metrics, including throughput values for the compute and memory sub-systems of the GPU. In the respective breakdown tables in this section for compute and memory, you can see which of the underlying units is causing the measured utilization.

It is correct that these throughput metrics don’t necessarily increase or decrease by optimizing a kernel. They don’t give a reference of how well optimized your kernel is, but rather of how much it is using the HW resources to their limits (the “speed of light”). A kernel may be utilizing e.g. a compute pipeline close to its maximum, but algorithmically inefficient, in which case you will likely still not call this an optimized kernel. Since the tool has no insight into what the kernel is actually intended to do, it can only give you advice on how much each HW resource is used.

The expectation is to follow the guidance given in each section, starting from the top. By collecting the “full” set of sections, you will get a very comprehensive analysis of your kernel’s performance. Follow the links in the rule guidance to understand what your kernel’s limiters and bottlenecks are and what suggestions are made to resolve these.

cuda_new_bird · August 8, 2023, 8:11am

Much clear about it. Thank you!

Topic		Replies	Views
Is optimization possible for this kernel? Nsight Compute cuda	0	1347	May 8, 2024
What exactly does SM Active Cycles mean? Nsight Compute	3	982	July 30, 2024
Could you suggest some ideas to improve my kernel's performance? CUDA Programming and Performance	3	39	September 23, 2024
Visualisation of Integer based Random Memory Access Kernel Nsight Compute	2	102	January 9, 2025
What is SOL ( speed of light)? Nsight Compute	5	6945	October 8, 2021
Bandwidth limited, Latency limited and Compute limited Need examples for each case CUDA Programming and Performance	1	6464	March 17, 2010
Methodology for the choice of metrics for Nsight Compute Sections? Nsight Compute	2	100	February 28, 2025
Memory SOL Throughput % Nsight Compute	2	156	September 20, 2024
Using nsight, how can I tell what is slow in a given kernel? Profiling Linux Targets	1	287	June 14, 2024
Why does reducing idle thread improve the performance significantly in reduction? CUDA Programming and Performance cuda , kernel	7	485	August 10, 2023

A faster implementation with faster speed gets lower compute and bandwidth throughput, why?

Related topics