Is it acceptable to measure kernel performance using Nsight Compute?

qkraudrb1999 · July 7, 2025, 6:22am

Hello!

I’m in a situation where it’s difficult to measure using cudaEvent or other internal functions, so I’m planning to measure the duration with Nsight Compute and then divide the data throughput by this duration to create a performance metric. I’m wondering if this would be a reliable method.

Thank you for reading my question

Greg · July 7, 2025, 5:18pm

Nsight Compute has a reliable timing method. This method is much more accurate and reliable than CUDA Events. NCU default settings are defined to produce deterministic results by fixing the GPU clock rate, and setting the memory system into a more deterministic state. You may want to use the following settings to control these behaviors:

  --cache-control arg (=all)            Control the behavior of the GPU caches during profiling. Allowed values:
                                          all
                                          none
  --clock-control arg (=base)           Control the behavior of the GPU clocks during profiling. Allowed values:
                                          base
                                          (Lock GPU clocks to base)
                                          none
                                          (Don't lock clocks)
                                          reset
                                          (Reset GPU clocks and exit)

Disabling cache control or not locking clocks can result in multi-pass metrics provide less accurate results that can even be out of a valid range due to differences between passes. The default settings try to remove differences between pass; however, this results in the value not being the same.

EXAMPLES OF DIFFERENCES

L2 Priming - If the application does a host to device memcpy then grid in the case of --cache-control=all the L2 will be invalidated losing potential priming done by the memcpy or previous kernel.
Clock Rate - --clock-control=base can impact the compute:bandwidth ratio on 100 class GPUs as HBM based GPUs do not have the ability to change the memory clock; however, base clock will reduce the SM clock.
Concurrent Execution - NCU serialize grids/ranges (depending on settings). This provides all resources to the target grid/range. In normal execution the application may have concurrent execution on the compute engine or copy engines that impact the duration.

Nsight Systems neither fixes the clocks nor caches between kernels/ranges. The timing results between NCU and NSYS can differ even if NCU is set to --cache-control=none, --clock-control=none.

qkraudrb1999 · July 8, 2025, 3:37am

Thank you for your answer!

system · July 22, 2025, 3:38am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistent kernel time between nsight and cudaEvent Nsight Compute cuda	2	1822	June 12, 2024
Nsight Compute slows down Tesla T4 processor clock during profiling Nsight Compute	5	905	October 12, 2021
Kernel Performance Discrepancy in Nsight Compute and Systems Nsight Compute	2	245	December 2, 2024
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	1250	April 3, 2024
Nsight Compute Clock Speed During Profiling Nsight Compute	4	2040	March 31, 2022
Profiling one application having two concurent kernels Nsight Compute	3	720	June 8, 2023
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	715	October 12, 2021
Unstable performance measured by cuda event CUDA Programming and Performance	3	523	December 6, 2022
Kernel execution measurement - profiling CUDA Programming and Performance	3	346	May 5, 2024
Which tool can accurately obtain kernel performance, ncu or nsys? Nsight Compute cuda , kernel , cublas	1	29	February 9, 2026

Is it acceptable to measure kernel performance using Nsight Compute?

Related topics