Unstable performance measured by cuda event

SparkHu · December 6, 2022, 8:28am

I repeatedly measured time cost of a batch of short kernels by cuda event.
The output is very unstable as follows:

Time cost:8.61389 ms.
Time cost:6.08563 ms.
Time cost:6.55667 ms.
Time cost:6.07232 ms.
Time cost:7.49773 ms.
Time cost:6.45325 ms.
Time cost:6.08666 ms.
Time cost:6.06003 ms.
Time cost:6.09587 ms.
Time cost:8.32205 ms.
Time cost:6.47475 ms.
Time cost:6.08666 ms.
Time cost:6.05491 ms.
Time cost:7.37382 ms.

But when I use nsight system UI to profile the same program, the output is much more stable:

Time cost:6.12899 ms.
Time cost:6.11926 ms.
Time cost:6.11315 ms.
Time cost:6.10931 ms.
Time cost:6.11888 ms.
Time cost:6.11216 ms.
Time cost:6.11264 ms.
Time cost:6.11046 ms.
Time cost:6.12355 ms.
Time cost:6.10992 ms.
Time cost:6.1209 ms.

What have nsight system UI done before run profiled program so the performance is very stable?
Or what should I do before measure performance of kernel by cuda event?

I tried to run nvidia-smi -lgc XX -lmc XX, but it didn’t help.

rs277 · December 6, 2022, 9:02am

You do not say what card you are using, but locking graphics/memory clocks via nvidia-smi is only available on Tesla and Quadro, (I believe) cards.

However GTX/RTX cards can have their clocks locked by Nsight Compute. Possibly the same occurs with Nsight Systems, I have little experience there and this might explain what you are seeing.

SparkHu · December 6, 2022, 9:17am

I test on 3080. The OS is ubuntu 18.04 LTS. Cuda toolkit is 11.7.
I checked the clock speed of memory and graphics output by nvidia-smi -q -d clock, both is locked to the specific speed. So I don’t think only the clock speed caused what I saw.

Besides. I also run Nsight System CLI to profile my program. The measured performance also varied a lot.
Only the Nsight System UI shows stable performance.

So the Nsight System UI must have done something before profile.

The question is what it is?

njuffa · December 6, 2022, 11:31am

This seems like an excellent question for the Nsight Systems forum: https://forums.developer.nvidia.com/c/development-tools/nsight-systems , or the Nsight Compute forum: https://forums.developer.nvidia.com/c/development-tools/nsight-compute/

Topic		Replies	Views
Inconsistent results with nsight systems Profiling Embedded Targets	5	823	June 20, 2023
Nsight Compute slows down Tesla T4 processor clock during profiling Nsight Compute	5	807	October 12, 2021
Inconsistent kernel time between nsight and cudaEvent Nsight Compute cuda	2	1675	June 12, 2024
Profile cuda kernel CUDA Programming and Performance	7	508	January 19, 2023
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10572	June 21, 2009
CUDA kernel is 6x slower in model than in a separate benchmark CUDA Programming and Performance cuda , kernel	6	439	February 17, 2023
Nsight Compute Clock Speed During Profiling Nsight Compute	4	1744	March 31, 2022
Kernel execution measurement - profiling CUDA Programming and Performance	3	239	May 5, 2024
How can I profile both kernel and cuda APIs hardware usage and application total duration Nsight Compute	5	422	March 27, 2024
Precision of events for recording time elapsed of a kernel CUDA Programming and Performance	5	1181	December 21, 2017

Unstable performance measured by cuda event

Related topics