I’m having an issue where the output of nvidia-smi doesn’t seem to be matching the amount of work that is being done on the machine. I am running on the same software on two different servers, but with on one server I’m getting the following results, with a very odd 0% GPU-Util amount, and also very low power usage:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro RTX 6000 Off | 00000000:3B:00.0 Off | 0 |
| N/A 50C P0 72W / 250W | 5799MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Quadro RTX 6000 Off | 00000000:AF:00.0 Off | 0 |
| N/A 61C P0 82W / 250W | 1792MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Quadro RTX 6000 Off | 00000000:D8:00.0 Off | 0 |
| N/A 30C P0 50W / 250W | 0MiB / 23040MiB | 4% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2993017 C audiovisualizer 1072MiB |
| 0 N/A N/A 2993018 C audiovisualizer 1246MiB |
| 0 N/A N/A 2993019 C audiovisualizer 974MiB |
| 0 N/A N/A 2993020 C audiovisualizer 1216MiB |
| 0 N/A N/A 2993021 C audiovisualizer 1240MiB |
| 1 N/A N/A 2993022 C audiovisualizer 1658MiB |
+---------------------------------------------------------------------------------------+
On the other server, running the same software (but different driver version), I’m getting this for the output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Quadro R... On | 00000000:3B:00.0 Off | 0 |
| N/A 48C P0 110W / 250W | 5714MiB / 22698MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Quadro R... On | 00000000:AF:00.0 Off | 0 |
| N/A 63C P0 176W / 250W | 1361MiB / 22698MiB | 92% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA Quadro R... On | 00000000:D8:00.0 Off | 0 |
| N/A 26C P8 13W / 250W | 0MiB / 22698MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 647652 C audiovisualizer 1348MiB |
| 0 N/A N/A 648903 C audiovisualizer 982MiB |
| 0 N/A N/A 648908 C audiovisualizer 1294MiB |
| 0 N/A N/A 648911 C audiovisualizer 1070MiB |
| 0 N/A N/A 648917 C audiovisualizer 1070MiB |
| 1 N/A N/A 1348854 C audiovisualizer 1216MiB |
+-----------------------------------------------------------------------------+
Which looks far more correct for the anticipated load on the GPU’s. I’m also only using the first 2 GPU’s on the servers, so the 0% utilization for the 3rd GPU is correct. On the first server though, even though I’m not using that GPU, it’s always stuck at 4% utilization, and the first two GPU’s which I am using, are stuck at 0% utilization.
The output from both servers is correct, i.e., the software on the server with the 0% GPU-utilization is working correctly.
Any ideas on why the incorrect output from nvidia-smi?