I’ve recently installed Ubuntu Server (24.04) on an AMD computer and moved two GPUs I’d previously used on a different server to it. Since installing the OS (standard installation), I’ve:
- Removed snap’s docker
- Installed real docker
- Installed nvidia drivers (nvidia-driver-575-server)
- Installed nvidia docker runtime
- Tested that the GPU works (note: had a 1660 in it for installation)
- Swapped out the 1660 for a 3090 Ti and 4500 Ada
After all of it, I’m now in the situation where my 4500 Ada is idling at around 36 Watts and the 3090 is idling at over 100 Watts.
I can’t find anything about how to solve this online, aside from “reinstall the drivers!” or “upgrade the drivers!”
I’ve now done a large number of iterations of apt remove --purge '^nvidia.*'
followed by apt install nvidia-driver-575-server
or apt install nvidia-driver-570-server
(just to try a previous version), and even tried them with restarts in between.
Here’s nvidia-smi
:
% nvidia-smi | sed 's:^: :'
Thu Jul 17 20:50:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Ti Off | 00000000:04:00.0 Off | Off |
| 0% 61C P0 103W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX 4500 Ada Gene... Off | 00000000:0A:00.0 Off | Off |
| 30% 55C P0 35W / 210W | 0MiB / 24570MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Here’s the first output from top
:
top - 20:52:18 up 12 min, 1 user, load average: 0.09, 0.03, 0.00
Tasks: 382 total, 1 running, 381 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 64200.0 total, 62541.5 free, 1304.3 used, 967.0 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 62895.8 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1983 christo+ 20 0 11904 5376 3328 R 8.3 0.0 0:00.02 top
1 root 20 0 22088 12272 9200 S 0.0 0.0 0:00.81 systemd
In the boot above, there was no display connected at boot or any time since, the GPU has not been used since boot, the CPU has not really been used… It’s as idle and pristine as I can get it. And yet it’s drawing over a hundred Watts. It didn’t do this in the previous server.
How on Earth do I debug this / fix this?
# lspci -v | grep -E '(3090|4500)' -A24 | sed 's:^: :'
04:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. GA102 [GeForce RTX 3090 Ti]
Flags: bus master, fast devsel, latency 0, IRQ 39, IOMMU group 22
Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
Expansion ROM at fa000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
0a:00.0 VGA compatible controller: NVIDIA Corporation AD104GL [RTX 4500 Ada Generation] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Dell AD104GL [RTX 4500 Ada Generation]
Flags: bus master, fast devsel, latency 0, IRQ 104, IOMMU group 25
Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
Memory at b0000000 (64-bit, prefetchable) [size=256M]
Memory at c0000000 (64-bit, prefetchable) [size=32M]
I/O ports at f000 [size=128]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] Secondary PCI Express
Capabilities: [bb0] Physical Resizable BAR
Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
Capabilities: [d00] Lane Margining at the Receiver <?>
Capabilities: [e00] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia