Inconsistent kernel time between nsight and cudaEvent

yuexialeilanglong · June 12, 2023, 6:37am

As stated in this blog, there is a preferred way of measuring time for a kernel. While using nsight compute, the time is pretty different. (I’m new to it, so I’m not sure whether it is the actual name to use. I used ncu -o profile ./main for profiling, and nsight compute UI to check the result.)
Here’s what happened in details:
I launch the kernel using cuLaunchKernel(), and measure the time in the exact same way as in the blog above. The time is about 9ms. But when I changed the code before this function, which resides entirely in CPU, the time changed pretty frequently. It’s sometimes 9ms, and sometimes 12ms. The weirdest part is, in the same process, the kernel is run for three times (while with some other functions like cuMemAlloc invoked between them), the three records are the same. But, if run the program for multiple times, the time will be 9ms at some cases, and 12ms at the other cases. Seems to be random.
I was upset, and tried nsight. This time the kernel time stays at 12ms. Remember there was a version of code where it is stable at 9ms? Even that version is said to be 12ms by nsight. (While these records do haven’t come out from a same run, as profiling destroy the “native” time record)
My question is:

What has caused such a phenomenon? My imaginary answer is invalid memory access, but that still does not make sense.
Is my way of using either cudaEvent or nsight wrong?
What’s the best method to evaluate time? Will that be nsight, or as stated in the blog above, or anything else? When there’s a conflict, which one should I believe?

jmarusarz · June 22, 2023, 8:27pm

When doing measurements using the APIs described in that forum post, they could include things other than kernel time, for example if the GPU context switched between start/stop or lazy initialization occurs. These types of things may make measurements inconsistent. The most consistent way to measure just kernel time is using Nsight Systems. Nsight Compute can also measure the pure kernel time but can impact it because it flushes caches and locks the clock. These can impact kernel measurements. You can disable the cache flushing and clock locking optionally in Nsight Compute if you are only interested in kernel duration.

359521840 · June 12, 2024, 10:28am

Hi, I just write a gemm profiler and want to use grid search the best tile. and I also find the kernel time inconsistent. Can I get better kernel time accuracy with cudaEvent?

Topic		Replies	Views
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	447	April 23, 2024
Inconsistent results with nsight systems Profiling Embedded Targets	5	955	June 20, 2023
Does Nsight makes the kernel/memory time longer? Profiling Linux Targets	4	250	April 25, 2025
Unstable performance measured by cuda event CUDA Programming and Performance	3	523	December 6, 2022
captyre time of Nsight CUDA Programming and Performance	1	633	November 25, 2014
Kernel time discrepancy between nsys profile and cudaEventElapsedTime Profiling Linux Targets cuda , kernel , profiling	4	874	April 28, 2023
Is it acceptable to measure kernel performance using Nsight Compute? Nsight Compute	3	205	July 22, 2025
Kernel time of Nsight system is larger than nsight compute Profiling Linux Targets	11	1250	April 3, 2024
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	715	October 12, 2021
CUDA kernel launched from Nsight Compute gives inconsistent results Nsight Compute	1	508	October 20, 2022

Inconsistent kernel time between nsight and cudaEvent

Related topics