Sum of kernel time is different in ncu and nsys

liyingaccount · February 23, 2022, 4:03am

Hi, all.
I use ncu and nsys to collect the time information for a TensorFlow program: resnet50. I get the gpu__time_duration.sum for ncu; Total Time (ns) in _gpukernsum file for nsys. Then I will add the time value for all the kernels together. But there is a big gap between the two results. For ncu，it’s 1801961344 ns; For nsys, it’s 537017196ns. Could you please tell me the reason if you know? Thank you so much.

Greg · February 24, 2022, 10:18pm

NCU

Serializes grid launches and runs only 1 grid at a time.
If too many metrics are specified NCU will replay the kernel clearing the GPU L2 cache, data cache, and instruction cache to have more deterministic replay between launches. To avoid this reduce the metrics collected to avoid replay and set --cache-control none.
Sets the GPU clocks to base clock frequencies. On some GPUs the base frequency is much lower than the boost frequency. Use the command line option --clock-control none to avoid fixing the clock rates.
Defines start timestamp as the time that the GPU front end sees the launch request to when the trailing memory barrier completes.
NSYS does not serialize grid launches.

NSYS

Does not serialize grid launches.
Does not set the GPU clocks to base frequency.
Defines start timestamp as when the first kernel thread executes on the GPU (adds small per warp overhead) and the end timestamp slightly before when NCU measures the end timestamp (generally < 200 ns difference).

The two tools measure different things. NCU is used to optimize a single grid launch. NSYS is used to resolve CPU/GPU interaction issues and GPU concurrency issues.

FindHao · February 25, 2022, 4:25pm

Hi @Greg,

If I’d like to measure the execution time for a single kernel, will the result of NCU be more accurate when we set --cache-control none --clock-control none and use nvidia-smito fix the GPU clocks? According to what you said, NSYS is more used to detect CPU/GPU interaction issues and GPU concurrency issues.

Greg · February 25, 2022, 6:43pm

For the best single kernel duration (full GPU) I would use NCU with --cache-control none --clock-control none and metrics collection limited to only gpu__time_duration.sum. nvidia-smiis only required if you want to manually manipulate the GPU clock frequencies.

liyingaccount · February 25, 2022, 7:03pm

Yeah, I use the --cache-control none --clock-control none and metrics collection limited to only ‘gpu__time_duration.sum’. But the time (sum of all the kernels’ ‘gpu__time_duration.sum’ values) for NCU is still larger (roughly 2 times more ) than the time (Total Time (ns) in _gpukernsum file for all kernels)in Nsys. Is there any other reason or I made a mistake ?

Greg · February 25, 2022, 10:46pm

What is your average grid duration?
Does the application submit grids on multiple streams allowing concurrency?
Does the application use CUDA graphs? NCU can serialize graph nodes but NSYS does not.

In order to answer your question I would like need to see a NCU report and a NSYS report (timeline). A guess would be serialization or clock rate differences.

liyingaccount · February 28, 2022, 9:50pm

Hi, @Greg

I didn’t get what you said. The average grid duration you mentioned is the avarege exectuion time of each run for this kernel, right?
I think, yes. there are five streams in total(see in ncu file or nsys.rep file)
No. I also attached the code. (see the link, resnet.py file)
commands I use:
metrics=gpu__time_duration.sum
nohup ncu --csv --metrics $metrics --page=raw -f --clock-control none --cache-control none --target-processes all python resnet.py > ncu_f50.csv &

/opt/nvidia/nsight-systems/2022.1.1/bin/nsys profile -t osrt,cuda,nvtx --show-output=true --export=sqlite -o ./nsys_f50 python resnet.py

other related:
NCU Version 2021.2.2.0;
cuda_11.4；
GPU: A100 40g;
tensorflow 1.15.5;
reports all in the link: forumques - Google Drive.

Thank you so much!

Greg · March 8, 2022, 10:35pm

I apologize for the delay. I was out of the office.

I created a pivot table from ncu_f50.csv and compared it to nsys_f50_gpukernsum.csv I do not see a ~3% difference when comparing averages. There is one kernel that has a 81% time difference which is concerning. As you stated there is a big difference it total sum. NSYS launched 4000 more kernels than NCU. My concern is the time difference is in the opposite direction.

The two big time differences come from these two kernels. As you can see the avg, max, min is not significant different but the number of times the kernel executed is different resulting a very large difference.

There are cases in the other direction where NSYS runs a kernel 3200 times more than NCU.

void cudnn::cnn::conv2d_grouped_direct_kernel<(bool)0, (bool)1, (bool)0, (bool)0, (int)0, (int)0, int, float, float, float, float, float, float>(cudnn::cnn::GroupedDirectFpropParams, const T9 *, const T11 *, T10 *, T12, T12, const T10 *, const T13 *, cudnnActivationStruct)

        avg         min         max         sum             count
NCU     453,056     62,592      1,396,096   326,200,352     720
NSYS    457,456     65,952      1,395,770     9,149,115      20

void implicit_convolve_sgemm<float, float, (int)128, (int)6, (int)7, (int)3, (int)3, (int)5, (int)1, (bool)0, (bool)0, (bool)1>(int, int, int, const T1 *, int, T2 *, const T1 *, kernel_conv_params, unsigned long long, int, float, float, int, const T2 *, const T2 *, bool, int, int)

        avg         min         max         sum             count
NCU     258,830     25,088      1,156,448   384,880,544     1,487
NSYS    328,705     25,056      1,155,707    12,162,098        37

liyingaccount · March 15, 2022, 7:31pm

Hi.Greg.
Thank you. This is very helpful.
So the reason why NSYS runs more kernels than NCU is the tools’ problem, right? So which one is better if I’d like to get GPU time information? Thank you so much.

Greg · March 15, 2022, 8:35pm

I ran only simple CUDA samples and the number of kernel count matched between NCU and NSYS. Without a reproducible the NCU and NSYS team will have a hard time determining the source of the difference.

liyingaccount · March 15, 2022, 8:39pm

Got it. I really appreciate your help.

system · March 29, 2022, 8:39pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	849	April 3, 2024
Inconsistent results with nsight systems Profiling Embedded Targets	5	813	June 20, 2023
Unable to map kernel names between NSYS and NCU reports Nsight Compute cudnn	2	122	August 27, 2024
Kernel execution measurement - profiling CUDA Programming and Performance	3	234	May 5, 2024
Is the Nsight System accurate in measuring the execution time of the kernel? Profiling Linux Targets	14	1511	April 6, 2024
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	612	October 12, 2021
Time column in "nsys stats" Profiling Linux Targets	2	1379	October 8, 2020
Optimizing Memory with NVIDIA Nsight Systems Technical Blog	1	450	June 28, 2023
GPU metrics in the Nsight System Profiling Linux Targets	3	770	October 15, 2024
How to quantify kernel launch overhead using NCU? Visual Profiler and nvprof	7	1596	April 1, 2024

Sum of kernel time is different in ncu and nsys

Related topics