I repeatedly measured time cost of a batch of short kernels by cuda event.
The output is very unstable as follows:
Time cost:8.61389 ms.
Time cost:6.08563 ms.
Time cost:6.55667 ms.
Time cost:6.07232 ms.
Time cost:7.49773 ms.
Time cost:6.45325 ms.
Time cost:6.08666 ms.
Time cost:6.06003 ms.
Time cost:6.09587 ms.
Time cost:8.32205 ms.
Time cost:6.47475 ms.
Time cost:6.08666 ms.
Time cost:6.05491 ms.
Time cost:7.37382 ms.
But when I use nsight system UI to profile the same program, the output is much more stable:
Time cost:6.12899 ms.
Time cost:6.11926 ms.
Time cost:6.11315 ms.
Time cost:6.10931 ms.
Time cost:6.11888 ms.
Time cost:6.11216 ms.
Time cost:6.11264 ms.
Time cost:6.11046 ms.
Time cost:6.12355 ms.
Time cost:6.10992 ms.
Time cost:6.1209 ms.
What have nsight system UI done before run profiled program so the performance is very stable?
Or what should I do before measure performance of kernel by cuda event?
I tried to run nvidia-smi -lgc XX -lmc XX, but it didn’t help.
You do not say what card you are using, but locking graphics/memory clocks via nvidia-smi is only available on Tesla and Quadro, (I believe) cards.
However GTX/RTX cards can have their clocks locked by Nsight Compute. Possibly the same occurs with Nsight Systems, I have little experience there and this might explain what you are seeing.
I test on 3080. The OS is ubuntu 18.04 LTS. Cuda toolkit is 11.7.
I checked the clock speed of memory and graphics output by nvidia-smi -q -d clock, both is locked to the specific speed. So I don’t think only the clock speed caused what I saw.
Besides. I also run Nsight System CLI to profile my program. The measured performance also varied a lot.
Only the Nsight System UI shows stable performance.
So the Nsight System UI must have done something before profile.