As stated in this blog, there is a preferred way of measuring time for a kernel. While using nsight compute, the time is pretty different. (I’m new to it, so I’m not sure whether it is the actual name to use. I used
ncu -o profile ./main for profiling, and nsight compute UI to check the result.)
Here’s what happened in details:
I launch the kernel using
cuLaunchKernel(), and measure the time in the exact same way as in the blog above. The time is about
9ms. But when I changed the code before this function, which resides entirely in CPU, the time changed pretty frequently. It’s sometimes
9ms, and sometimes
12ms. The weirdest part is, in the same process, the kernel is run for three times (while with some other functions like cuMemAlloc invoked between them), the three records are the same. But, if run the program for multiple times, the time will be
9ms at some cases, and
12ms at the other cases. Seems to be random.
I was upset, and tried nsight. This time the kernel time stays at
12ms. Remember there was a version of code where it is stable at
9ms? Even that version is said to be
12ms by nsight. (While these records do haven’t come out from a same run, as profiling destroy the “native” time record)
My question is:
- What has caused such a phenomenon? My imaginary answer is invalid memory access, but that still does not make sense.
- Is my way of using either cudaEvent or nsight wrong?
- What’s the best method to evaluate time? Will that be nsight, or as stated in the blog above, or anything else? When there’s a conflict, which one should I believe?