Questions on per-process GPU utilization

Question 1: According to the documentation, nvmlDeviceGetProcessUtilization reports per-process utilization (including GPU/memory/encoder/decoder utilization values in unit of percent). However, our unit test shows some confusing results.

Suppose within a sampling period of 1 second (which applies to the 2080 SUPER GPU in our test), process 0 runs a kernel that lasts 0.2 second, and process 1 runs a kernel that lasts 0.3 second. When either of these processes runs exclusively as a single resident process on the GPU, we can observe 20% GPU utilization for process 0 and 30% GPU utilization for process 1, reported correctly either by the per-process API nvmlDeviceGetProcessUtilization or by the per-device API nvmlDeviceGetUtilizationRates .

However, when process 0 and 1 run concurrently on the GPU within a sampling period, an oddity occurs: nvmlDeviceGetProcessUtilization would report that either process 0 or 1 has a 50% GPU utilization, whereas the other process 0% GPU utilization. This happens whether the two kernels from two processes overlap within the sampling period (in which case the GPU context switch will occur, dilating the execution time of both kernels), or they don’t.

We even use nvidia-smi pmon as a reference and observe similar phenomenon (below), where one process is assigned a value of 50%, and the other not available (-)

# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0          -     -     -     -     -     -   -
    0      76715     C    17     0     -     -   dummy-proc-0
    0      76716     C     -     -     -     -   dummy-proc-1
    0      76715     C    33     0     -     -   dummy-proc-0
    0      76716     C     -     -     -     -   dummy-proc-1
    0      76715     C    50     0     -     -   dummy-proc-0
    0      76716     C     -     -     -     -   dummy-proc-1
    0      76715     C    50     0     -     -   dummy-proc-0
    0      76716     C     -     -     -     -   dummy-proc-1
    0      76715     C    50     0     -     -   dummy-proc-0
    0      76716     C     -     -     -     -   dummy-proc-1
    0      76715     C    50     0     -     -   dummy-proc-0

We have replicated this on 2080 Super, 3080 Ti GPUs using CUDA runtime/driver API 11.8 and 12.2.

So the question is: can nvmlDeviceGetProcessUtilization be used to query per-process utilization at all? What is the correct way to monitor the per-process utilization in percent?

Question 2: In the documentation, nvmlDeviceGetProcessUtilization has been listed in section 2.22. vGPU APIs. Does it mean, for some reason, that this function only applies to the virtual GPU (which we are not using at all)?

Question 3: Both nvidia-smi pmon and nvmlProcessUtilizationSample_t use the terminology “smUtil”. Does this simply refer to the GPU utilization as in nvmlUtilization_t ?

As far as I know, the term SM utilization only applies to Hopper GPUs with GPM support, and its meaning is “percentage of SMs that were busy” according to this as opposed to “percent of time over the past sample period during which one or more kernels was executing on the GPU” as in GPU utilization. So we would like to have some clarifications from NVML developers.

Thanks!

1 Like
  1. Per GPU process utilization is not accurate.

  2. That documentation is wrong.

  3. They mean graphics utilization here. See https://github.com/NVIDIA/nvidia-settings/blob/1e202d03434e698e70a6fd2264d07fcf6c68ffa1/src/nvml.h#L1379

They mean graphics utilization here.

Could you elaborate a bit on this part? The comment says “SM (3D/Compute) Util Value”. Doesn’t it indicate the contributions from both the compute and the graphics?

Either or. Nvidia’s documentation is all over the place. on nvmlUtilization_t struct or w/e it says something about running a CUDA kernel but the utilization value is also affected by 3D applications(e.g. games).

I have a JavaFX-based app using NVML you can more easily observe this here: GitHub - BlueGoliath/Envious-FX: A JavaFX based Nvidia GPU monitoring utility using the native NVML library. Graphics and compute both make up “graphics” utilization as shown by the “Graphics Utilization” tiles and monitors.

Process utilization is calculated only for a single running process in GPU. Currently it is not supported for concurrent running processes.

So what are the other process values? Garbage data? Does it have any meaning at all? Is there any chance of this getting fixed?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.