Is it possible this Bug is still present in 555.52.04 ?
My GeForce RTX 4080 Max-Q gets eventually stuck here
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4080 ... Off | 00000000:01:00.0 On | N/A |
| N/A 43C P0 ERR! / 150W | 1633MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
It might start during boot and even set to my max 175W Limit.
at 23:25:08 ❯ sudo systemctl status nvidia-powerd
[sudo] password for crashdummy:
● nvidia-powerd.service - nvidia-powerd service
Loaded: loaded (/etc/systemd/system/nvidia-powerd.service; enabled; preset: enabled)
Drop-In: /usr/lib/systemd/system/service.d
└─10-timeout-abort.conf
Active: active (running) since Fri 2024-06-07 23:24:47 CEST; 23s ago
Main PID: 2478 (nvidia-powerd)
Tasks: 3 (limit: 76636)
Memory: 532.0K (peak: 1.0M)
CPU: 26ms
CGroup: /system.slice/nvidia-powerd.service
└─2478 /usr/bin/nvidia-powerd
Jun 07 23:24:47 crashtux systemd[1]: Started nvidia-powerd.service - nvidia-powerd service.
Jun 07 23:24:47 crashtux /usr/bin/nvidia-powerd[2478]: nvidia-powerd version:1.0(build 1)
Jun 07 23:24:47 crashtux /usr/bin/nvidia-powerd[2478]: Dbus Connection is established
But after a while it appears like it gets stuck with its last setting
| NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4080 ... Off | 00000000:01:00.0 On | N/A |
| N/A 54C P0 80W / 155W | 1656MiB / 12282MiB | 0% Default |
| | | N/A |
After another reboot:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4080 ... Off | 00000000:01:00.0 On | N/A |
| N/A 55C P0 41W / 175W | 1935MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
afterards my logs are flooded with
Jun 07 18:00:26 crashtux /usr/bin/nvidia-powerd[32650]: error setting power limit
Jun 07 18:00:26 crashtux /usr/bin/nvidia-powerd[32650]: Error setting GPU limit: 175000.
Jun 07 18:00:26 crashtux /usr/bin/nvidia-powerd[32650]: error setting power limit
Jun 07 18:00:26 crashtux /usr/bin/nvidia-powerd[32650]: Error setting GPU limit: 175000.
Jun 07 18:00:26 crashtux /usr/bin/nvidia-powerd[32650]: error setting power limit
The dmesg doesnt look like nvidia is doing anything crazy.
$ sudo dmesg | grep -i nvidia
[ 8.821580] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input18
[ 8.822338] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input19
[ 8.822561] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input20
[ 8.833622] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input21
[ 11.410580] nvidia: module license 'NVIDIA' taints kernel.
[ 11.410586] nvidia: module license taints kernel.
[ 11.592208] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 11.593568] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
[ 11.593745] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 11.641046] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 555.52.04 Tue Jun 4 13:54:58 UTC 2024
[ 11.705867] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 11.814803] nvidia-uvm: Loaded the UVM driver, major device number 506.
[ 11.849925] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 555.52.04 Tue Jun 4 13:21:08 UTC 2024
[ 11.853667] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 13.472530] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DP-0
[ 13.481596] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DP-0
[ 13.511857] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 13.532462] nvidia 0000:01:00.0: [drm] fb1: nvidia-drmdrmfb frame buffer device