Inconsistent kernel time between nsight and cudaEvent

As stated in this blog, there is a preferred way of measuring time for a kernel. While using nsight compute, the time is pretty different. (I’m new to it, so I’m not sure whether it is the actual name to use. I used ncu -o profile ./main for profiling, and nsight compute UI to check the result.)
Here’s what happened in details:
I launch the kernel using cuLaunchKernel(), and measure the time in the exact same way as in the blog above. The time is about 9ms. But when I changed the code before this function, which resides entirely in CPU, the time changed pretty frequently. It’s sometimes 9ms, and sometimes 12ms. The weirdest part is, in the same process, the kernel is run for three times (while with some other functions like cuMemAlloc invoked between them), the three records are the same. But, if run the program for multiple times, the time will be 9ms at some cases, and 12ms at the other cases. Seems to be random.
I was upset, and tried nsight. This time the kernel time stays at 12ms. Remember there was a version of code where it is stable at 9ms? Even that version is said to be 12ms by nsight. (While these records do haven’t come out from a same run, as profiling destroy the “native” time record)
My question is:

  1. What has caused such a phenomenon? My imaginary answer is invalid memory access, but that still does not make sense.
  2. Is my way of using either cudaEvent or nsight wrong?
  3. What’s the best method to evaluate time? Will that be nsight, or as stated in the blog above, or anything else? When there’s a conflict, which one should I believe?

When doing measurements using the APIs described in that forum post, they could include things other than kernel time, for example if the GPU context switched between start/stop or lazy initialization occurs. These types of things may make measurements inconsistent. The most consistent way to measure just kernel time is using Nsight Systems. Nsight Compute can also measure the pure kernel time but can impact it because it flushes caches and locks the clock. These can impact kernel measurements. You can disable the cache flushing and clock locking optionally in Nsight Compute if you are only interested in kernel duration.