Sum of kernel time is different in ncu and nsys

Hi, all.
I use ncu and nsys to collect the time information for a TensorFlow program: resnet50. I get the gpu__time_duration.sum for ncu; Total Time (ns) in _gpukernsum file for nsys. Then I will add the time value for all the kernels together. But there is a big gap between the two results. For ncu,it’s 1801961344 ns; For nsys, it’s 537017196ns. Could you please tell me the reason if you know? Thank you so much.

1 Like


  • Serializes grid launches and runs only 1 grid at a time.
  • If too many metrics are specified NCU will replay the kernel clearing the GPU L2 cache, data cache, and instruction cache to have more deterministic replay between launches. To avoid this reduce the metrics collected to avoid replay and set --cache-control none.
  • Sets the GPU clocks to base clock frequencies. On some GPUs the base frequency is much lower than the boost frequency. Use the command line option --clock-control none to avoid fixing the clock rates.
  • Defines start timestamp as the time that the GPU front end sees the launch request to when the trailing memory barrier completes.
    NSYS does not serialize grid launches.


  • Does not serialize grid launches.
  • Does not set the GPU clocks to base frequency.
  • Defines start timestamp as when the first kernel thread executes on the GPU (adds small per warp overhead) and the end timestamp slightly before when NCU measures the end timestamp (generally < 200 ns difference).

The two tools measure different things. NCU is used to optimize a single grid launch. NSYS is used to resolve CPU/GPU interaction issues and GPU concurrency issues.


Hi @Greg,

If I’d like to measure the execution time for a single kernel, will the result of NCU be more accurate when we set --cache-control none --clock-control none and use nvidia-smito fix the GPU clocks? According to what you said, NSYS is more used to detect CPU/GPU interaction issues and GPU concurrency issues.

For the best single kernel duration (full GPU) I would use NCU with --cache-control none --clock-control none and metrics collection limited to only gpu__time_duration.sum. nvidia-smiis only required if you want to manually manipulate the GPU clock frequencies.

Yeah, I use the --cache-control none --clock-control none and metrics collection limited to only ‘gpu__time_duration.sum’. But the time (sum of all the kernels’ ‘gpu__time_duration.sum’ values) for NCU is still larger (roughly 2 times more ) than the time (Total Time (ns) in _gpukernsum file for all kernels)in Nsys. Is there any other reason or I made a mistake ?

  1. What is your average grid duration?
  2. Does the application submit grids on multiple streams allowing concurrency?
  3. Does the application use CUDA graphs? NCU can serialize graph nodes but NSYS does not.

In order to answer your question I would like need to see a NCU report and a NSYS report (timeline). A guess would be serialization or clock rate differences.

Hi, @Greg

  1. I didn’t get what you said. The average grid duration you mentioned is the avarege exectuion time of each run for this kernel, right?

  2. I think, yes. there are five streams in total(see in ncu file or nsys.rep file)

  3. No. I also attached the code. (see the link, file)

  4. commands I use:
    nohup ncu --csv --metrics $metrics --page=raw -f --clock-control none --cache-control none --target-processes all python > ncu_f50.csv &

/opt/nvidia/nsight-systems/2022.1.1/bin/nsys profile -t osrt,cuda,nvtx --show-output=true --export=sqlite -o ./nsys_f50 python

  1. other related:
    NCU Version 2021.2.2.0;
    GPU: A100 40g;
    tensorflow 1.15.5;

  2. reports all in the link: forumques - Google Drive.

Thank you so much!

I apologize for the delay. I was out of the office.

I created a pivot table from ncu_f50.csv and compared it to nsys_f50_gpukernsum.csv I do not see a ~3% difference when comparing averages. There is one kernel that has a 81% time difference which is concerning. As you stated there is a big difference it total sum. NSYS launched 4000 more kernels than NCU. My concern is the time difference is in the opposite direction.

The two big time differences come from these two kernels. As you can see the avg, max, min is not significant different but the number of times the kernel executed is different resulting a very large difference.

There are cases in the other direction where NSYS runs a kernel 3200 times more than NCU.

void cudnn::cnn::conv2d_grouped_direct_kernel<(bool)0, (bool)1, (bool)0, (bool)0, (int)0, (int)0, int, float, float, float, float, float, float>(cudnn::cnn::GroupedDirectFpropParams, const T9 *, const T11 *, T10 *, T12, T12, const T10 *, const T13 *, cudnnActivationStruct)

        avg         min         max         sum             count
NCU     453,056     62,592      1,396,096   326,200,352     720
NSYS    457,456     65,952      1,395,770     9,149,115      20

void implicit_convolve_sgemm<float, float, (int)128, (int)6, (int)7, (int)3, (int)3, (int)5, (int)1, (bool)0, (bool)0, (bool)1>(int, int, int, const T1 *, int, T2 *, const T1 *, kernel_conv_params, unsigned long long, int, float, float, int, const T2 *, const T2 *, bool, int, int)

        avg         min         max         sum             count
NCU     258,830     25,088      1,156,448   384,880,544     1,487
NSYS    328,705     25,056      1,155,707    12,162,098        37
1 Like

Thank you. This is very helpful.
So the reason why NSYS runs more kernels than NCU is the tools’ problem, right? So which one is better if I’d like to get GPU time information? Thank you so much.

I ran only simple CUDA samples and the number of kernel count matched between NCU and NSYS. Without a reproducible the NCU and NSYS team will have a hard time determining the source of the difference.

Got it. I really appreciate your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.