Question 1: According to the documentation, nvmlDeviceGetProcessUtilization reports per-process utilization (including GPU/memory/encoder/decoder utilization values in unit of percent). However, our unit test shows some confusing results.
Suppose within a sampling period of 1 second (which applies to the 2080 SUPER GPU in our test), process 0 runs a kernel that lasts 0.2 second, and process 1 runs a kernel that lasts 0.3 second. When either of these processes runs exclusively as a single resident process on the GPU, we can observe 20% GPU utilization for process 0 and 30% GPU utilization for process 1, reported correctly either by the per-process API nvmlDeviceGetProcessUtilization or by the per-device API nvmlDeviceGetUtilizationRates .
However, when process 0 and 1 run concurrently on the GPU within a sampling period, an oddity occurs: nvmlDeviceGetProcessUtilization would report that either process 0 or 1 has a 50% GPU utilization, whereas the other process 0% GPU utilization. This happens whether the two kernels from two processes overlap within the sampling period (in which case the GPU context switch will occur, dilating the execution time of both kernels), or they don’t.
We even use
nvidia-smi pmon as a reference and observe similar phenomenon (below), where one process is assigned a value of 50%, and the other not available (-)
# gpu pid type sm mem enc dec command # Idx # C/G % % % % name 0 - - - - - - - 0 76715 C 17 0 - - dummy-proc-0 0 76716 C - - - - dummy-proc-1 0 76715 C 33 0 - - dummy-proc-0 0 76716 C - - - - dummy-proc-1 0 76715 C 50 0 - - dummy-proc-0 0 76716 C - - - - dummy-proc-1 0 76715 C 50 0 - - dummy-proc-0 0 76716 C - - - - dummy-proc-1 0 76715 C 50 0 - - dummy-proc-0 0 76716 C - - - - dummy-proc-1 0 76715 C 50 0 - - dummy-proc-0
We have replicated this on 2080 Super, 3080 Ti GPUs using CUDA runtime/driver API 11.8 and 12.2.
So the question is: can nvmlDeviceGetProcessUtilization be used to query per-process utilization at all? What is the correct way to monitor the per-process utilization in percent?
Question 2: In the documentation, nvmlDeviceGetProcessUtilization has been listed in section 2.22. vGPU APIs. Does it mean, for some reason, that this function only applies to the virtual GPU (which we are not using at all)?
As far as I know, the term SM utilization only applies to Hopper GPUs with GPM support, and its meaning is “percentage of SMs that were busy” according to this as opposed to “percent of time over the past sample period during which one or more kernels was executing on the GPU” as in GPU utilization. So we would like to have some clarifications from NVML developers.