Which tool can accurately obtain kernel performance, ncu or nsys?

I am trying to obtain accurate performance data for a GPU Matmul operator. The test sample is adapted from CUDALibrarySamples/cuBLASLt/LtNvfp4Matmul at main · NVIDIA/CUDALibrarySamples · GitHub , and I have used two methods for testing (B200 cuda12.8):

  1. Using ncu --clock-control none --set full ../LtNvfp4Matmul/build/sample_cublasLt_LtNvfp4Matmul 16 8192 5120 0 1 0, the duration obtained is 11.15 µs.
  2. Using nsys profile --trace=cuda,nvtx --stats=true ../LtNvfp4Matmul/build/sample_cublasLt_LtNvfp4Matmul 8192 8192 5120 0 1 0, the kernel execution time obtained is 7.0 µs.

There is a significant discrepancy between the times measured by the two tools. Which one should I trust?
Is the kernel execution time measured with nsys accurate? Are warm-up operations or other steps necessary?
If I need to obtain real and accurate kernel performance data in a production environment, which tool should I use for testing, and how should I conduct the test?

This is a memory-bound case. If the kernel time is 7.0 µs, then the equivalent bandwidth is 3–4 TB/s, which is only half of the B200’s HBM bandwidth. Could it be that this 7.0 µs is also inaccurate, and the actual kernel execution time should be shorter?

There are two likely reasons why the time differs:

  1. In the case of NCU the default command line option is to --cache-control=all which is going to clear all of the caches. If a previous workload (copy or kernel) had primed the caches then the data will be lost and will need to be fetched from device memory or system memory. Consider running with application replay --replay-mode=application (or app-range) and --cache-control=none
  2. NCU and NSYS capture the start timestamp and end timestamp differently which can result in 0.5 - 2 µs difference.

NCU Timestamps

  • Start Timestamp is captured by inserting a command into the command buffer to write out a timestamp prior to the grid launch command but after any additional work such as copying parameters into device memory is done. The duration will include part of the compute pipe from the hardware schedule to the compute work distributor. This can add 0.5 - 2.0 µs.
  • Stop Timestamp is after the grid completes its MEMBAR guaranteeing all memory writes have been ACK’d either at system memory or GPU L2.

NSYS Timestamps

  • The Start Timestamp is captured using different techniques:
    • If Hardware Event Trace is available on the GPU (Blackwell+) and OS (not supported on some mobile and vGPU environments at this time), then this is used. The Start Timestamp when the Compute Work Distributor launches the first CTA. This can still be 100s of cycles before the first instruction is issued.
    • Else, CUPTI will add additional instructions before the user kernel that are executed by every warp. Warp 0 will write out a timestamp to memory. This adds per warp overhead and is not used by NCU due to the impact on the performance counters.
  • The End Timestamp is captured using different technicques:
    • If Hardware Event Trace is available, then this is after the grid complete the MEMBAR.
    • Else, CUPTI will have hardware generate a timestamp after the MEMBAR completes.
    • These two should be very close.