Idle power usage problem (P8) after debian driver distupgrade 470->525 [RTX 3090]

Hi,

I just upgraded my debian server from bullseye (11) to bookworm (12).
This resulted in new default debian nvidia drivers being used.

Debian 11: 470.182.03 → Debian 12: 525.105.17

With the new driver Idle power won’t go back to P8 with 7 Watts power usage instead it is consuming 20 Watts at idle now.
When i reboot the system it will stay at 7 Watts first but once i load my LLM Model it won’t idle at 7 Watts anymore.

The RTX 3090 card is connected to a headless system attached via external usb pci-e 2.0 1x breakout adapter from 16x slot no x.org or nouveau kernel driver running and persistence mode was on.
It got a LLM Model loaded not being used (0%).

Any ideas how i could fix this without going back to old drivers?
Which system dumps or other nvidia-smi outputs are needed to analyze this driver issue further?

Update: When i disable the LLM and then run “nvidia-smi -r” power is back to 7-8 Watts.

    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled

nvidia-smi output:

Sat Jun 17 22:40:38 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.1 82.03   CUDA Version: 11.4    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0 Off |                  N/A |
|  0%   39C    P8     7W / 350W |  18520MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     58715      C   python                          16952MiB |
+-----------------------------------------------------------------------------+
Sun Jun 18 18:47:44 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0 Off |                  N/A |
|  0%   56C    P8    20W / 350W |  16955MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     35226      C   python                          16952MiB |
+-----------------------------------------------------------------------------+
1 Like

I have the same problem, but I noticed that it could go down to 5w on the desktop version. See my post here:

Would be nice to get to the bottom of this. Wasting 3x as much power in idle is not too good.

1 Like

yes thats why my workaround is to reset the card via nvidia-smi at the moment …
I also tried to install latest stable production branch same problem:

Wed Jun 21 01:33:34 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:03:00.0 Off |                  N/A |
|  0%   53C    P8              19W / 350W |  16878MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18129      C   python                                    16870MiB |
+---------------------------------------------------------------------------------------+

I gave up and went back to desktop version. This power usage makes no sense, but im fairly certain its related to xorg.conf. If it works for you maybe you can write a script that executes something on the gpu then does a reset as a workaround. I’ll try to gather more info and contact nvidia directly.

I switched back to 470.182.03 now which is working fine and stable.
530.41.03 and 535.54.03 have the same power idling issues.
My Server also crashed when I loaded different types of pytorch models so i had to hard reboot.
At another day the GPU has fallen off the bus.

[ 5893.142189] NVRM: Xid (PCI:0000:03:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 5893.142202] NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.
[ 5893.142233] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Not to mention that I can’t hot plug the GPU via USB Cable (PCI-E 1x Bus on there), so it is not possible to plug it in while my server is running without doing another reboot.

I hope NVIDIA will address those problems soon.

These Idle Power issues are also present at data center cards and setups see screenshots below:

I have to resurrect this page as I returned to see how much the datacenter gpus consume in idle. But that might be a different problem. Try to look into the P states as they seem to be in P0 which is a performance state unlike P8 which is the lowest tier aka powersaving. Will keep an eye on this thread to see where it goes if it goes somewhere at all.