Idle power usage on RTX3060 LHR is stuck at 10-20 watts after running an app, like FFMPEG using the nvenc asic or using cuda. It does not return back to 4-5 watts and the card heats up to around 50c. But sometimes a card does get into the low power idle state (4-5watts), I am not sure why.
To test run something on the GPU, stop it. Notice 15-25w while idle, then do modprobe -r nvidia_drm; modprobe nvidia_drm to reset the GPUs, notice the power is back to 10ish watts with 1 lucky GPU at 4w.
How to make all the GPUs idle at 4-5w after running workload?
Hi @vans554 and welcome back to the developer forums!
Can you share a bit more detail on your setup? For example:
What kind of enclosure are you using
what brand(s) are the GPUs
which Linux Distribution is this running on?
when you observe the status as above, what are the fans doing of the respective GPUs?
How long does the above status stay as is?
With this additional information I can reach out internally if this is a known behavior or some unusual situation.
The above nvidia-smi output indicates that all the GPUs are correctly in P8 idle state, which means the lowest realistic power state is reached. But the additional ~6W do not justify the extra 10C temperature. So my suspicion is that the fans are running with higher RPMs and causing the higher idle power consumption.
which Linux Distribution is this running on? Ubuntu 22.04
when you observe the status as above, what are the fans doing of the respective GPUs? not spinning / nvidia-smi has them at 0%
How long does the above status stay as is? forever
NOTE: if I run a task to use the NVENC asic + a task to use the cuda cores. then kill both said tasks, the power at idle is even higher around 20w. And it never drops.
NOTE2: If i remove GPUs (leaving 1) or put a leaf blower to them (moves temps down to 30C) the wattage does not go down. It seems that exact GPU always can use 4w in the lower power state, whether its slotted solo, or with others.
I was hoping you might have a homogeneous set of GPUs, the mixed setup of different manufacturers and different GPUs (LHR vs non-LHR) will make it difficult to resolve this. I will see if I can find internal resources with more information.
Do you see the “misbehaving” GPUs ever be in a state with lower idle power? For example after boot?
Did you check with EVGA or Zotac support already? Might be worth contacting them to see if this is a known issue with them.
I plugged in a single GPU into the main slot, rebooted + kept temps really down by putting a giant fan in front of it.
As you can see there is no difference.
As soon as we boot up
root@node1:~# nvidia-smi
Thu Jun 30 12:57:19 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:C3:00.0 Off | N/A |
| 0% 33C P0 41W / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
5-10 seconds after boot
root@node1:~# nvidia-smi
Thu Jun 30 12:57:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:C3:00.0 Off | N/A |
| 0% 32C P8 14W / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
30 minutes after boot
root@node1:~# nvidia-smi
Thu Jun 30 12:57:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:C3:00.0 Off | N/A |
| 0% 32C P8 13W / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Sadly I could not get any suggestions beyond what I stated, that it might be related to manufacturer specifics on the GPU and that they are simply stuck with 10-14W idle power consumption. The GPU is in the lowest power state, so it should consume less, but finding the reason for it without access to the HW is not possible. So if this is a big concern for you, you should contact the OEM or the point of sale.