How to quantify kernel launch overhead using NCU?

Hello
I want to quantify the additional overhead of kernel launches beyond the execution time. The straightforward approach of Elapsed Cycles - SM Active Cycles doesn’t seem accurate, as SM Active Cycles appears to be weighted based on the number of SMs utilized.

  1. My goal is to quantify the launch overhead of kernels, not just the execution time.
  2. From reviewing forumshttps://forums.developer.nvidia.com/t/how-can-i-measure-kernel-launch-overhead-using-ncu/250619/6, I understand that in the GPU timeline, the portion captured by NCU includes launch overhead, while Nsight Systems captures the SM execution part. However, NCU is not designed to directly provide launch overhead information.
  3. SM Active Cycles seems to be calculated as a weighted sum of active cycles across all SMs, based on the number of SMs actually used. If the precise weighting algorithm or formula could be provided, would it be possible to approximate the launch overhead as Elapsed Cycles - weighted SM Active Cycles?

Does anyone have any ideas? We are expecting a reply

Hi, @18511878861

Sorry for the late response.
I send your question to our dev internally to see which tools can meet your requirement.

Meanwhile, can you clarify what kind of overhead are you interested ?

The observation of sm__cycles_active{.avg, .sum} is problematic if not all SMs are active for same cycles. This doesn’t handle launching a grid of 1 thread. The more accurate method using these two metrics would be

estimated_launch_overhead_in_gpc_cycles = sm__cycles_elapsed.max - sm__cycles_active.max
estimated_launch_overhead_in_ns = gpu__time_duration.sum * estimated_launch_overhead_in_gpc_cycles / sm__cycles_elapsed.max

This estimate includes the following overhead:

  • Cycles from performance monitor start trigger to first cycle a SM is active.
  • Cycles from last warp complete to stop trigger.

I have not tested this estimate. I would recommend testing on a grid of 1 thread with no parameters and empty kernel code.

Hello Greg
Thank you for your help. It seems that there are some problems with the results. I tried this method and found that there were negative numbers in elapsed.max - active.max. The test program is cublas. The test command is as follows

ncu --set full --metrics sm__cycles_active.max,sm__cycles_elapsed.max -o m16384_n256_k4_16384 ./CublassTest
M N K sm__cycles_elapsed.max [cycle] sm__cycles_active.max [cycle] elapsed.max-active.max sm__cycles_active.avg [cycle] GridSize elapsed.max-active.avg
16384 256 4 57411 56039 1372 53481.19 512 3929.81
16384 256 8 53557 52814 743 47006.83 256 6550.17
16384 256 16 54442 52944 1498 48871.23 256 5570.77
16384 256 32 56177 54898 1279 49562.06 256 6614.94
16384 256 64 62074 59025 3049 53571.65 256 8502.35
16384 256 128 71079 67872 3207 61744.1 256 9334.9
16384 256 256 94546 91361 3185 83413.38 256 11132.62
16384 256 512 144229 141022 3207 126686.4 256 17542.6
16384 256 1024 247222 244968 2254 221383.31 256 25838.69
16384 256 2048 442415 437754 4661 424985.29 512 17429.71
16384 256 4096 829170 825422 3748 801429.29 512 27740.71
16384 256 8192 1764811 1768175 -3364 1591038.42 256 173772.58
16384 256 16384 3484585 3477970 6615 3128722.02 256 355862.98

m16384_n256_k4_16384.zip (1.1 MB)

i tried calculating elapsed.max - active.avg in the table is :
elapsed.max-48 / min(48,gridsize) * active.avg
(SM number is 48)

Test different grid sizes with empty kernel function:


Also, if there are such irregular points in the calculation overhead, can we explain it?

ISSUE 1 - Negative Number

The report you specified collected a full report. The collection of sm__cycles_elapsed, sm__cycles_active, and gpu__time_duration were in different passes.

The information you are trying to obtain is a form of micro-benchmarking. This requires isolation of the system. As a first step make sure the 3 metrics are collected in the same pass by disabling all sections and only collecting the three metrics.

--metrics sm__cycles_elapsed.max,sm__cycles_active.max,gpu__time_duration.sum

For improved stability disable and other application using the GPU. This may not be possible in your setup.

ISSUE 2 - Spike

The type of data you are trying to collect requires isolation of work and generally removing anomalies.

  1. Isolate the GPU from external factors such as other accelerated processes, display, etc.
  2. Avoid any high CPU spike that may cause modification of power between CPU and GPU. Your report appears to have been collected on a mobile GPU.
  3. Ensure metrics are collected in the same pass.

I ran an experiment that launched a null kernel with various configurations of --cache-control, --clock-control, and other GPU processes and there were variances (“spikes”) when I did not control the environment.

Given the extremely small duration of the spike (<800 cycles == <1microsecond) I suspect the spike is due to one or more of the following:

  • failure to collect metrics in the same pass
  • cache miss
  • uTLB or MMU miss
  • contention with another GPU activity from a different process or engine (e.g. display)

The difference is very close to L2 miss latency.

Hello Greg
thanks for the reply,follow your suggestion
isolation of the system … yes
keep CPU and GPU idle … yes (no display)
only collecting the three metrics … yes
collecting in one pass … yes

ncu --launch-skip 100  --metrics sm__cycles_elapsed.max,sm__cycles_active.max,gpu__time_duration.sum,gpc__cycles_elapsed.avg.per_second  --cache-control none --clock-control base  -o m16384_n256_k4_16384_warm100_none_base ./CublassTest 

I compared using the cublas matmul program to test different configurations of cache-control:
M=16384 N=256 K=4…16384

//warmup 100
for(i=2;i<=14;i++){
gemm(m=16384,n=256,k=pow(2,i))
}

M=N=8…32768 K=512

//warmup 100
for(i=3;i<=15i++){
gemm(m=n=pow(2,i),k=512)
}

image

M=N=32768 K=512

//warmup 100
for(i=0;i<13i++){
gemm(m=n=32768,k=512)
}

image

  • no negative number this time
  • In the second test, except for the last point, the other overheads are all around 2000.

Is the calculation method “elapsed-active” correct?
Is the last point of the second test affected by cache, How do I confirm?
Another interesting thing is that the elapsed data collected using ncu --set full is smaller than the metrics of elapsed, acitve, and duration collected only. What is the reason for this?

Hello Greg
Is it possible for us to further clarify the essence of overflow? If you need me to do the testing, please try to mention it as much as possible. My testing environment is
SYS: ubuntu
GPU 3080 laptop
Cuda: 11.4
Ncu: 2021.2.0.0
Nsys: December 12, 2021

Please help me analyze the results and explain the overhead correctly,I think the issues that need special attention are:
1. Test command:
Current method:

ncu --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cache-control all --clock-control base

2. Overhead calculation method:
Current method:

sm__cycles_elapsed.max - sm__cycles_active.max

3. What does overload contain and can unexpected values be explained
The current test results are around 2000 cycles, except for M=N=32768 K=512. How can we explain this

Looking forward to a reply