Idle power usage stuck at 10-20watts after running an app

Idle power usage on RTX3060 LHR is stuck at 10-20 watts after running an app, like FFMPEG using the nvenc asic or using cuda. It does not return back to 4-5 watts and the card heats up to around 50c. But sometimes a card does get into the low power idle state (4-5watts), I am not sure why.

To test run something on the GPU, stop it. Notice 15-25w while idle, then do modprobe -r nvidia_drm; modprobe nvidia_drm to reset the GPUs, notice the power is back to 10ish watts with 1 lucky GPU at 4w.

How to make all the GPUs idle at 4-5w after running workload?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:18:00.0 Off |                  N/A |
|  0%   50C    P8    14W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:51:00.0 Off |                  N/A |
|  0%   49C    P8    10W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:8A:00.0 Off |                  N/A |
|  0%   40C    P8     4W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:C3:00.0 Off |                  N/A |
|  0%   49C    P8    10W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------

Hi @vans554 and welcome back to the developer forums!

Can you share a bit more detail on your setup? For example:

  • What kind of enclosure are you using
  • what brand(s) are the GPUs
  • which Linux Distribution is this running on?
  • when you observe the status as above, what are the fans doing of the respective GPUs?
  • How long does the above status stay as is?
    With this additional information I can reach out internally if this is a known behavior or some unusual situation.

The above nvidia-smi output indicates that all the GPUs are correctly in P8 idle state, which means the lowest realistic power state is reached. But the additional ~6W do not justify the extra 10C temperature. So my suspicion is that the fans are running with higher RPMs and causing the higher idle power consumption.

Thanks!

  • What kind of enclosure are you using
    Air enclosure

  • what brand(s) are the GPUs

18:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: ZOTAC International (MCO) Ltd. GA106 [GeForce RTX 3060 Lite Hash Rate]
    Physical Slot: 5

51:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: eVga.com. Corp. Device 3658
    Physical Slot: 3

8a:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: ZOTAC International (MCO) Ltd. GA106 [GeForce RTX 3060 Lite Hash Rate]
    Physical Slot: 1

c3:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: eVga.com. Corp. Device 3658
    Physical Slot: 7
  • which Linux Distribution is this running on?
    Ubuntu 22.04

  • when you observe the status as above, what are the fans doing of the respective GPUs?
    not spinning / nvidia-smi has them at 0%

  • How long does the above status stay as is?
    forever

NOTE: if I run a task to use the NVENC asic + a task to use the cuda cores. then kill both said tasks, the power at idle is even higher around 20w. And it never drops.

NOTE2: If i remove GPUs (leaving 1) or put a leaf blower to them (moves temps down to 30C) the wattage does not go down. It seems that exact GPU always can use 4w in the lower power state, whether its slotted solo, or with others.

Thanks for the details!

I was hoping you might have a homogeneous set of GPUs, the mixed setup of different manufacturers and different GPUs (LHR vs non-LHR) will make it difficult to resolve this. I will see if I can find internal resources with more information.

Do you see the “misbehaving” GPUs ever be in a state with lower idle power? For example after boot?

Did you check with EVGA or Zotac support already? Might be worth contacting them to see if this is a known issue with them.

I plugged in a single GPU into the main slot, rebooted + kept temps really down by putting a giant fan in front of it.
As you can see there is no difference.

As soon as we boot up

root@node1:~# nvidia-smi
Thu Jun 30 12:57:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:C3:00.0 Off |                  N/A |
|  0%   33C    P0    41W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

5-10 seconds after boot

root@node1:~# nvidia-smi
Thu Jun 30 12:57:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:C3:00.0 Off |                  N/A |
|  0%   32C    P8    14W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

30 minutes after boot

root@node1:~# nvidia-smi
Thu Jun 30 12:57:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:C3:00.0 Off |                  N/A |
|  0%   32C    P8    13W / 170W |      1MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@MarkusHoHo Any idea what can be wrong?

Hello again,

Sadly I could not get any suggestions beyond what I stated, that it might be related to manufacturer specifics on the GPU and that they are simply stuck with 10-14W idle power consumption. The GPU is in the lowest power state, so it should consume less, but finding the reason for it without access to the HW is not possible. So if this is a big concern for you, you should contact the OEM or the point of sale.

It is a common problem on linux. See #9951 (OpenEncodeSessionEx failed: out of memory (10): (no details)) – FFmpeg

1 Like

I would rather say a common problem of Nvidia on Linux.