575: 3090 idling at over 100 Watts

I’ve recently installed Ubuntu Server (24.04) on an AMD computer and moved two GPUs I’d previously used on a different server to it. Since installing the OS (standard installation), I’ve:

- Removed snap’s docker
- Installed real docker
- Installed nvidia drivers (nvidia-driver-575-server)
- Installed nvidia docker runtime
- Tested that the GPU works (note: had a 1660 in it for installation)
- Swapped out the 1660 for a 3090 Ti and 4500 Ada

After all of it, I’m now in the situation where my 4500 Ada is idling at around 36 Watts and the 3090 is idling at over 100 Watts.

I can’t find anything about how to solve this online, aside from “reinstall the drivers!” or “upgrade the drivers!”

I’ve now done a large number of iterations of apt remove --purge '^nvidia.*' followed by apt install nvidia-driver-575-server or apt install nvidia-driver-570-server (just to try a previous version), and even tried them with restarts in between.

Here’s nvidia-smi:

% nvidia-smi | sed 's:^:    :'
Thu Jul 17 20:50:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off |   00000000:04:00.0 Off |                  Off |
|  0%   61C    P0            103W /  450W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 4500 Ada Gene...    Off |   00000000:0A:00.0 Off |                  Off |
| 30%   55C    P0             35W /  210W |       0MiB /  24570MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Here’s the first output from top:

top - 20:52:18 up 12 min,  1 user,  load average: 0.09, 0.03, 0.00
Tasks: 382 total,   1 running, 381 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem :  64200.0 total,  62541.5 free,   1304.3 used,    967.0 buff/cache     
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  62895.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
   1983 christo+  20   0   11904   5376   3328 R   8.3   0.0   0:00.02 top      
      1 root      20   0   22088  12272   9200 S   0.0   0.0   0:00.81 systemd  

In the boot above, there was no display connected at boot or any time since, the GPU has not been used since boot, the CPU has not really been used… It’s as idle and pristine as I can get it. And yet it’s drawing over a hundred Watts. It didn’t do this in the previous server.

How on Earth do I debug this / fix this?

# lspci -v | grep -E '(3090|4500)' -A24 | sed 's:^:    :'
04:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: eVga.com. Corp. GA102 [GeForce RTX 3090 Ti]
	Flags: bus master, fast devsel, latency 0, IRQ 39, IOMMU group 22
	Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Memory at e0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	Expansion ROM at fa000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

--
0a:00.0 VGA compatible controller: NVIDIA Corporation AD104GL [RTX 4500 Ada Generation] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Dell AD104GL [RTX 4500 Ada Generation]
	Flags: bus master, fast devsel, latency 0, IRQ 104, IOMMU group 25
	Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
	Memory at b0000000 (64-bit, prefetchable) [size=256M]
	Memory at c0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at f000 [size=128]
	Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] Secondary PCI Express
	Capabilities: [bb0] Physical Resizable BAR
	Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00] Lane Margining at the Receiver <?>
	Capabilities: [e00] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

What power setting do you have configured under the nvidia-settings utility?

I don’t have nvidia-settings installed, nor did I on the previous computer. So, whatever the defaults are.

Since I posted this, I used the GPU briefly, and while the RAM was resident and it showed one process using the GPU (but 0% utilization because the program was idle) it showed an idle power consumption of 18W. But, as soon as the program was closed and the RAM and GPU utilization went back to 0, the power went back up to 100+ Watts. I’m not seeing this usage on the power meter I have the computer connected to, but it’s noisy enough I need a lot more time to make sure…

What does “nvidia-smi -q” show regarding clocks, with the cards idle?

Both are showing they are in maximum performance level (P0) when idle, so some setting is forcing this.

They’re both in P0…

% nvidia-smi -q | sed 's:^:    :'

==============NVSMI LOG==============

Timestamp                                 : Thu Jul 17 22:27:41 2025
Driver Version                            : 575.57.08
CUDA Version                              : 12.9

Attached GPUs                             : 2
GPU 00000000:04:00.0
    Product Name                          : NVIDIA GeForce RTX 3090 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Requested functionality has been deprecated
    Display Attached                      : No
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-e715a0fe-6458-3fac-4d86-9918ed100286
    Minor Number                          : 0
    VBIOS Version                         : 94.02.A0.00.5B
    MultiGPU Board                        : No
    Board ID                              : 0x400
    Board Part Number                     : N/A
    GPU Part Number                       : 2203-350-A1
    FRU Part Number                       : N/A
    Platform Info
        Chassis Serial Number             : N/A
        Slot Number                       : N/A
        Tray Index                        : N/A
        Host ID                           : N/A
        Peer Type                         : N/A
        Module Id                         : 1
        GPU Fabric GUID                   : N/A
    Inforom Version
        Image Version                     : G002.0000.00.03
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : Requested functionality has been deprecated
        Drain and Reset Recommended       : Requested functionality has been deprecated
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 575.57.08
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x04
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x220310DE
        Bus Id                            : 00000000:04:00.0
        Sub System Id                     : 0x49853842
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 4x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 350 KB/s
        Rx Throughput                     : 300 KB/s
        Atomic Caps Outbound              : N/A
        Atomic Caps Inbound               : N/A
    Fan Speed                             : 0 %
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Clocks Event Reasons Counters
        SW Power Capping                  : 0 us
        Sync Boost                        : 0 us
        SW Thermal Slowdown               : 0 us
        HW Thermal Slowdown               : 0 us
        HW Power Braking                  : 0 us
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 24564 MiB
        Reserved                          : 452 MiB
        Used                              : 0 MiB
        Free                              : 24113 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 1 MiB
        Free                              : 255 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        GPU                               : 2 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    DRAM Encryption Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 59 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 97 C
        GPU Slowdown Temp                 : 94 C
        GPU Max Operating Temp            : 92 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Average Power Draw                : 103.73 W
        Instantaneous Power Draw          : 107.23 W
        Current Power Limit               : 450.00 W
        Requested Power Limit             : 450.00 W
        Default Power Limit               : 450.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 480.00 W
    GPU Memory Power Readings 
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
    Module Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Smoothing                       : N/A
    Workload Power Profiles
        Requested Profiles                : N/A
        Enforced Profiles                 : N/A
    Clocks
        Graphics                          : 1920 MHz
        SM                                : 1920 MHz
        Memory                            : 10501 MHz
        Video                             : 1680 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2115 MHz
        SM                                : 2115 MHz
        Memory                            : 10501 MHz
        Video                             : 1965 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : Requested functionality has been deprecated
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
            Route Recovery in progress    : N/A
            Route Unhealthy               : N/A
            Access Timeout Recovery       : N/A
    Processes                             : None
    Capabilities
        EGM                               : disabled

GPU 00000000:0A:00.0
    Product Name                          : NVIDIA RTX 4500 Ada Generation
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ada Lovelace
    Display Mode                          : Requested functionality has been deprecated
    Display Attached                      : No
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : <redacted>
    GPU UUID                              : GPU-714e9031-9e44-884e-beae-19042d032e7d
    Minor Number                          : 1
    VBIOS Version                         : 95.04.63.00.07
    MultiGPU Board                        : No
    Board ID                              : 0xa00
    Board Part Number                     : 900-5G132-0160-000
    GPU Part Number                       : 27B1-875-A1
    FRU Part Number                       : N/A
    Platform Info
        Chassis Serial Number             : N/A
        Slot Number                       : N/A
        Tray Index                        : N/A
        Host ID                           : N/A
        Peer Type                         : N/A
        Module Id                         : 1
        GPU Fabric GUID                   : N/A
    Inforom Version
        Image Version                     : G132.0560.00.02
        OEM Object                        : 2.1
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : Requested functionality has been deprecated
        Drain and Reset Recommended       : Requested functionality has been deprecated
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 575.57.08
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x0A
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x27B110DE
        Bus Id                            : 00000000:0A:00.0
        Sub System Id                     : 0x180C1028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 350 KB/s
        Rx Throughput                     : 350 KB/s
        Atomic Caps Outbound              : N/A
        Atomic Caps Inbound               : N/A
    Fan Speed                             : 30 %
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Clocks Event Reasons Counters
        SW Power Capping                  : 0 us
        Sync Boost                        : 0 us
        SW Thermal Slowdown               : 0 us
        HW Thermal Slowdown               : 0 us
        HW Power Braking                  : 0 us
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 24570 MiB
        Reserved                          : 483 MiB
        Used                              : 0 MiB
        Free                              : 24088 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        GPU                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    DRAM Encryption Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 96 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 54 C
        GPU T.Limit Temp                  : 36 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    GPU Power Readings
        Average Power Draw                : 35.42 W
        Instantaneous Power Draw          : 35.60 W
        Current Power Limit               : 210.00 W
        Requested Power Limit             : 210.00 W
        Default Power Limit               : 210.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 210.00 W
    GPU Memory Power Readings 
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
    Module Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Smoothing                       : N/A
    Workload Power Profiles
        Requested Profiles                : N/A
        Enforced Profiles                 : N/A
    Clocks
        Graphics                          : 2580 MHz
        SM                                : 2580 MHz
        Memory                            : 9001 MHz
        Video                             : 2100 MHz
    Applications Clocks
        Graphics                          : 2580 MHz
        Memory                            : 9001 MHz
    Default Applications Clocks
        Graphics                          : 2580 MHz
        Memory                            : 9001 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 3105 MHz
        SM                                : 3105 MHz
        Memory                            : 9001 MHz
        Video                             : 2415 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : Requested functionality has been deprecated
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
            Route Recovery in progress    : N/A
            Route Unhealthy               : N/A
            Access Timeout Recovery       : N/A
    Processes                             : None
    Capabilities
        EGM                               : disabled

Yes, and the GPU and Memory clocks are both running near maximum.

The reason I asked about nvidia-settings, is that one setting is “Adaptive” which should allow the clocks to idle at a much lower speed. If this is available as a seperate package it may be worth installing and trying. I’m onl;y familiar with installing the “.run” drivers, where everything’s installed.

1 Like

Ok, several things…

From my perspective, this is a bug. This is a new installation of a system with the drivers installed and this is behaving exceptionally poorly with no information as to why…

However, I just removed the apt-installed drivers and installed NVIDIA-Linux-x86_64-570.172.08.run, and now:

daystrom% nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

It looks like Ubuntu has reverted back to nouveau since that installation…

Sorry to hear that. Did the nvidia-settings package for your original driver not do anything?

After uninstalling the .run driver, I reinstalled 575-server from apt, and also installed nvidia-settings. However, nvidia-settings requires you to be running an X-server and this is headless (so it won’t do anything).

I’ve made some other… weird discoveries.

If I open nvtop, it will be at 100W and fast clocks as soon as I launch it, but in 5-10 seconds it’ll drop down to 210 MHz and 17W (and 7W for the 4500) and stay there.

Device 0 [NVIDIA GeForce RTX 3090 Ti]     PCIe GEN 1@ 4x RX: 400.0 KiB/s TX: 450.0 KiB/s
GPU 210MHz  MEM 405MHz  TEMP  45°C FAN   0% POW  17 / 450 W
GPU[                                     0%] MEM[                      0.441Gi/23.988Gi]

Device 1 [NVIDIA RTX 4500 Ada Generation] PCIe GEN 1@16x RX: 300.0 KiB/s TX: 300.0 KiB/s
GPU 210MHz  MEM 405MHz  TEMP  39°C FAN  30% POW   7 / 210 W
GPU[                                     0%] MEM[                      0.471Gi/23.994Gi]

If, in another window, I run nvidia-smi, it will put the GPU into the high power mode and it’ll jump back to 100W. If I watch nvidia-smi in that other terminal, then I can watch nvtop fluctuate power modes repeatedly. As soon as I ctrl-C the watch then I can see nvtop still running in the other window drop down to the low power state and stay there.

In all of these tests, there are zero processes running with the GPU, nor has anything been used since boot.

I am extremely confused…

I wasn’t aware you were headless.

Normal behaviour is any application that access a card primes the card to high performance state initially and after a period of time if there’s no activity, it drops clocks speeds and PCIe Gen and width down, to reduce power.

Your situation seems to crop up quite a bit. Some people have had success reducing monitor refresh rates. An example post is here.

I’m saying I’m having the opposite experience.

When nothing is happening, it’s in a high power state.

As soon as anything looks at it, it goes to a lower power state.

nvidia-smi, being a single-scan, doesn’t reliably seen to do it (but when I put it in watch it sometimes does).

nvtop will coax it into a low power state.

If I start a Jupyter notebook with GPU access (even if all it does is look at it - not even use it or allocate memory), it’ll go to a low power state.

But when I close those and nothing is happening, it’ll go to a high power state.

Or at least that’s what it looks like.

If I’m doing nothing at all, and nothing is using the GPU, has memory resident on the GPU, nor has the GPU even open, then the GPU is in a high power state (which is shown by nvidia-smi and by nvtop for the first few seconds).

If I have nvtop running (and it’s stable in low power / low speed) and I run nvidia-smi once, I can watch in nvtop as both GPUs spike their clocks (and power) up then back down.

Another piece of information: last night (when I posted this and we had our first back and forths) I left the computer running all night with the 575 driver running. During that time the whole system averaged 178 Watts. This morning, when I posted my reply about installing the .run driver and how nouveau was running, the system dropped down to an average of 130 Watts. It has since gone back up since I got home and changed back to the non-.run driver (where the Nvidia tools actually work now).

One thing I’m now wondering: is nvidia-smi actually causing a power increase which it then reports? But then (1) why would this not have happened on my other computer with these same two GPUs and (2) why would nvtop show the high usage when it first loads the drop back down (and stay low, even if I set nvtop to refresh every 0.1 seconds)?

any application that access a card primes the card to high performance state

Does this mean any access? Even nvidia-smi and nvtop?

Some people have had success reducing monitor refresh rates

As a reminder: headless. Every DP/HDMI port on the back of both GPUs is empty.

Another really, really weird thing I just found out…

When nothing is running (system is idle, nothing has the GPU open, nothing is resident on the GPU, and nvtop and nvidia-smi are not running and haven’t been for a while) the system uses about 175 Watts.

As soon as I open nvtop, the system’s power draw ramps up to 250 Watts for 10 seconds then drops down to 125 Watts.

As soon as I close nvtop the system it goes right back to 175 Watts.

Nothing else changes in this time. Nothing else is started, nothing else is closed, etc.

Yes. Here’s an output from “nvidia-smi dmon -d 2” on an idle machine:

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    25    24     -     0     0     0     0  4006   734
    0    26    25     -     0     0     0     0  4006  1506
    0    26    26     -     0     0     0     0  4006  1506
    0    26    26     -     0     0     0     0  4006  1506
    0    26    26     -     0     0     0     0  4006  1506
    0    26    26     -     0     0     0     0  4006  1506
    0    23    26     -     0     0     0     0  4006   885
    0    23    25     -     0     0     0     0  3802   885
    0    23    26     -     0     0     0     0  3802   885
    0    10    25     -     0     0     0     0   810   708
    0     9    24     -     0     0     0     0   405   164
    0     9    24     -     0     0     0     0   405   139
    0     9    24     -     0     0     0     0   405   139
    0     9    24     -     0     0     0     0   405   139
    0     9    24     -     0     0     0     0   405   139
    0     9    24     -     0     0     0     0   405   139
    0     9    24     -     0     0     0     0   405   139

Each line is two seconds apart.

Interesting, ok so that explain one part of it.

I just did a screen capture to show you some things because it’s a bit hard to fully explain but you can see quickly (and maybe you’ll pick up on something else on the screen that I missed): here is the screen recording.

The only anomaly I see there, is when you close nvtop and the power increases 50W. The bumps in clock speed at various points is normal, as the utilities initialise.

What do you see if you start nvtop, wait for clocks to settle, then run “nvidia-smi dmon -d 2” and after clocks have stablised again, close nvtop, leaving nvidia-smi running?

Interesting. I’m going to do two separate versions of that:

Version 1:

  • Power: 175 W
  • Start nvidia-smi dmon -d 2
  • It reports 100 Watts and high speed; power meter records 255ish
  • Wait for nvidia-smi to stabilize
  • Power meter stabilizes to 125 W
  • Start nvtop
  • Clocks remain low, power remains low, all outputs unchanged
  • Stopped nvidia-smi. nvtop and power meter remain same.
  • Stopped nvtop
  • Power meter goes back to 175 W

Version 2:

  • Power: 175 W
  • Started nvtop
  • Rise and then fall exactly like in video; waited until clock/power stabilizing
  • Started nvidia-smi dmon -d 2
  • No change in clocks or powers
  • Stopped nvtop
  • No change to system power, or nvidia-smi output
  • Stopped nvidia-smi
  • System power went back up to 175W

So really all we know for sure is the overall power consumption increases, but we don’t know what part of the system is contributing the extra. I’m not convinced the GPUs are the issue.

That being said, when I had installed the .run drivers (which didn’t work, so the nouveau drivers were loaded) the power draw was that low power as well. I’ll do a test later with a FLIR camera and the GPU. to see if it actually is the GPU.