Hello
I want to quantify the additional overhead of kernel launches beyond the execution time. The straightforward approach of Elapsed Cycles - SM Active Cycles doesn’t seem accurate, as SM Active Cycles appears to be weighted based on the number of SMs utilized.
My goal is to quantify the launch overhead of kernels, not just the execution time.
SM Active Cycles seems to be calculated as a weighted sum of active cycles across all SMs, based on the number of SMs actually used. If the precise weighting algorithm or formula could be provided, would it be possible to approximate the launch overhead as Elapsed Cycles - weighted SM Active Cycles?
The observation of sm__cycles_active{.avg, .sum} is problematic if not all SMs are active for same cycles. This doesn’t handle launching a grid of 1 thread. The more accurate method using these two metrics would be
Hello Greg
Thank you for your help. It seems that there are some problems with the results. I tried this method and found that there were negative numbers in elapsed.max - active.max. The test program is cublas. The test command is as follows
ncu --set full --metrics sm__cycles_active.max,sm__cycles_elapsed.max -o m16384_n256_k4_16384 ./CublassTest
The report you specified collected a full report. The collection of sm__cycles_elapsed, sm__cycles_active, and gpu__time_duration were in different passes.
The information you are trying to obtain is a form of micro-benchmarking. This requires isolation of the system. As a first step make sure the 3 metrics are collected in the same pass by disabling all sections and only collecting the three metrics.
For improved stability disable and other application using the GPU. This may not be possible in your setup.
ISSUE 2 - Spike
The type of data you are trying to collect requires isolation of work and generally removing anomalies.
Isolate the GPU from external factors such as other accelerated processes, display, etc.
Avoid any high CPU spike that may cause modification of power between CPU and GPU. Your report appears to have been collected on a mobile GPU.
Ensure metrics are collected in the same pass.
I ran an experiment that launched a null kernel with various configurations of --cache-control, --clock-control, and other GPU processes and there were variances (“spikes”) when I did not control the environment.
Given the extremely small duration of the spike (<800 cycles == <1microsecond) I suspect the spike is due to one or more of the following:
failure to collect metrics in the same pass
cache miss
uTLB or MMU miss
contention with another GPU activity from a different process or engine (e.g. display)
Hello Greg
thanks for the reply,follow your suggestion
isolation of the system … yes
keep CPU and GPU idle … yes (no display)
only collecting the three metrics … yes
collecting in one pass … yes
In the second test, except for the last point, the other overheads are all around 2000.
Is the calculation method “elapsed-active” correct?
Is the last point of the second test affected by cache, How do I confirm?
Another interesting thing is that the elapsed data collected using ncu --set full is smaller than the metrics of elapsed, acitve, and duration collected only. What is the reason for this?
Hello Greg
Is it possible for us to further clarify the essence of overflow? If you need me to do the testing, please try to mention it as much as possible. My testing environment is
SYS: ubuntu
GPU 3080 laptop
Cuda: 11.4
Ncu: 2021.2.0.0
Nsys: December 12, 2021
Please help me analyze the results and explain the overhead correctly,I think the issues that need special attention are: 1. Test command:
Current method:
ncu --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cache-control all --clock-control base
2. Overhead calculation method:
Current method:
sm__cycles_elapsed.max - sm__cycles_active.max
3. What does overload contain and can unexpected values be explained
The current test results are around 2000 cycles, except for M=N=32768 K=512. How can we explain this