How to quantify kernel launch overhead using NCU?

18511878861 · March 19, 2024, 7:37am

Hello
I want to quantify the additional overhead of kernel launches beyond the execution time. The straightforward approach of Elapsed Cycles - SM Active Cycles doesn’t seem accurate, as SM Active Cycles appears to be weighted based on the number of SMs utilized.

My goal is to quantify the launch overhead of kernels, not just the execution time.
From reviewing forumshttps://forums.developer.nvidia.com/t/how-can-i-measure-kernel-launch-overhead-using-ncu/250619/6, I understand that in the GPU timeline, the portion captured by NCU includes launch overhead, while Nsight Systems captures the SM execution part. However, NCU is not designed to directly provide launch overhead information.
SM Active Cycles seems to be calculated as a weighted sum of active cycles across all SMs, based on the number of SMs actually used. If the precise weighting algorithm or formula could be provided, would it be possible to approximate the launch overhead as Elapsed Cycles - weighted SM Active Cycles?

18511878861 · March 22, 2024, 2:09am

Does anyone have any ideas? We are expecting a reply

veraj · March 22, 2024, 2:27am

Hi, @18511878861

Sorry for the late response.
I send your question to our dev internally to see which tools can meet your requirement.

Meanwhile, can you clarify what kind of overhead are you interested ?

Greg · March 22, 2024, 9:45pm

The observation of sm__cycles_active{.avg, .sum} is problematic if not all SMs are active for same cycles. This doesn’t handle launching a grid of 1 thread. The more accurate method using these two metrics would be

estimated_launch_overhead_in_gpc_cycles = sm__cycles_elapsed.max - sm__cycles_active.max
estimated_launch_overhead_in_ns = gpu__time_duration.sum * estimated_launch_overhead_in_gpc_cycles / sm__cycles_elapsed.max

This estimate includes the following overhead:

Cycles from performance monitor start trigger to first cycle a SM is active.
Cycles from last warp complete to stop trigger.

I have not tested this estimate. I would recommend testing on a grid of 1 thread with no parameters and empty kernel code.

18511878861 · March 25, 2024, 3:36am

Hello Greg
Thank you for your help. It seems that there are some problems with the results. I tried this method and found that there were negative numbers in elapsed.max - active.max. The test program is cublas. The test command is as follows

ncu --set full --metrics sm__cycles_active.max,sm__cycles_elapsed.max -o m16384_n256_k4_16384 ./CublassTest

M	N	K	sm__cycles_elapsed.max [cycle]	sm__cycles_active.max [cycle]	elapsed.max-active.max	sm__cycles_active.avg [cycle]	GridSize	elapsed.max-active.avg
16384	256	4	57411	56039	1372	53481.19	512	3929.81
16384	256	8	53557	52814	743	47006.83	256	6550.17
16384	256	16	54442	52944	1498	48871.23	256	5570.77
16384	256	32	56177	54898	1279	49562.06	256	6614.94
16384	256	64	62074	59025	3049	53571.65	256	8502.35
16384	256	128	71079	67872	3207	61744.1	256	9334.9
16384	256	256	94546	91361	3185	83413.38	256	11132.62
16384	256	512	144229	141022	3207	126686.4	256	17542.6
16384	256	1024	247222	244968	2254	221383.31	256	25838.69
16384	256	2048	442415	437754	4661	424985.29	512	17429.71
16384	256	4096	829170	825422	3748	801429.29	512	27740.71
16384	256	8192	1764811	1768175	-3364	1591038.42	256	173772.58
16384	256	16384	3484585	3477970	6615	3128722.02	256	355862.98

m16384_n256_k4_16384.zip (1.1 MB)

i tried calculating elapsed.max - active.avg in the table is ：
elapsed.max-48 / min(48,gridsize) * active.avg
(SM number is 48)

Test different grid sizes with empty kernel function：

Also, if there are such irregular points in the calculation overhead, can we explain it?

Greg · March 25, 2024, 8:28pm

ISSUE 1 - Negative Number

The report you specified collected a full report. The collection of sm__cycles_elapsed, sm__cycles_active, and gpu__time_duration were in different passes.

The information you are trying to obtain is a form of micro-benchmarking. This requires isolation of the system. As a first step make sure the 3 metrics are collected in the same pass by disabling all sections and only collecting the three metrics.

--metrics sm__cycles_elapsed.max,sm__cycles_active.max,gpu__time_duration.sum

For improved stability disable and other application using the GPU. This may not be possible in your setup.

ISSUE 2 - Spike

The type of data you are trying to collect requires isolation of work and generally removing anomalies.

Isolate the GPU from external factors such as other accelerated processes, display, etc.
Avoid any high CPU spike that may cause modification of power between CPU and GPU. Your report appears to have been collected on a mobile GPU.
Ensure metrics are collected in the same pass.

I ran an experiment that launched a null kernel with various configurations of --cache-control, --clock-control, and other GPU processes and there were variances (“spikes”) when I did not control the environment.

Given the extremely small duration of the spike (<800 cycles == <1microsecond) I suspect the spike is due to one or more of the following:

failure to collect metrics in the same pass
cache miss
uTLB or MMU miss
contention with another GPU activity from a different process or engine (e.g. display)

The difference is very close to L2 miss latency.

18511878861 · March 26, 2024, 12:59pm

Hello Greg
thanks for the reply，follow your suggestion
isolation of the system … yes
keep CPU and GPU idle … yes (no display)
only collecting the three metrics … yes
collecting in one pass … yes

ncu --launch-skip 100  --metrics sm__cycles_elapsed.max,sm__cycles_active.max,gpu__time_duration.sum,gpc__cycles_elapsed.avg.per_second  --cache-control none --clock-control base  -o m16384_n256_k4_16384_warm100_none_base ./CublassTest

I compared using the cublas matmul program to test different configurations of cache-control：
M=16384 N=256 K=4…16384

//warmup 100
for(i=2;i<=14;i++){
gemm(m=16384,n=256,k=pow(2,i))
}

M=N=8…32768 K=512

//warmup 100
for(i=3;i<=15i++){
gemm(m=n=pow(2,i),k=512)
}

M=N=32768 K=512

//warmup 100
for(i=0;i<13i++){
gemm(m=n=32768,k=512)
}

no negative number this time
In the second test, except for the last point, the other overheads are all around 2000.

Is the calculation method “elapsed-active” correct?
Is the last point of the second test affected by cache， How do I confirm?
Another interesting thing is that the elapsed data collected using ncu --set full is smaller than the metrics of elapsed, acitve, and duration collected only. What is the reason for this?

18511878861 · April 1, 2024, 3:12am

Hello Greg
Is it possible for us to further clarify the essence of overflow? If you need me to do the testing, please try to mention it as much as possible. My testing environment is
SYS: ubuntu
GPU 3080 laptop
Cuda: 11.4
Ncu: 2021.2.0.0
Nsys: December 12, 2021

Please help me analyze the results and explain the overhead correctly，I think the issues that need special attention are:
1. Test command:
Current method：

ncu --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cache-control all --clock-control base

2. Overhead calculation method：
Current method：

sm__cycles_elapsed.max - sm__cycles_active.max

3. What does overload contain and can unexpected values be explained
The current test results are around 2000 cycles, except for M=N=32768 K=512. How can we explain this

Looking forward to a reply

Topic		Replies	Views
How can I measure kernel launch overhead using ncu Nsight Compute	7	1270	May 4, 2023
What exactly does SM Active Cycles mean? Nsight Compute	3	763	July 30, 2024
Is there a way to inspect the time cost of each individual cuda block? Nsight Compute	12	176	October 30, 2024
Why does the “overhead” suddenly increase when testing cublas? Nsight Compute	4	803	August 26, 2024
EVSL Lib is 190 ms on Quadro P520, but 82 ms on Titan RTX2080 CUDA Programming and Performance performance	12	634	December 27, 2020
Overhead of launching a new thread block CUDA Programming and Performance	9	2163	December 1, 2016
[kernel switch latency] Successive kernels switch latency CUDA Programming and Performance	14	126	October 16, 2024
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4590	February 15, 2018
Metric references and description Nsight Compute	7	4303	March 2, 2024
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	926	March 23, 2024

How to quantify kernel launch overhead using NCU?

Related topics