Nvidia-powerd crashing my laptop with CPU Hotplug

Hello there.

I’ve noticed the following behavior recently: Whenever I plug back the power cord on my laptop, nvidia-powerd decides to kill my machine due to an Error, malformed CPU data.

Relevant Setup data:

  • ASUS TUF Dash F15 FX517ZR , Nvidia RTX3070 Max-Q
  • 12th Gen Intel i7-12650H
  • Arch Linux, Kernel 6.6.3-zen1-1-zen
  • CPU Hotplug setup with laptop-mode-tools. At /etc/laptop-mode/conf.d/cpuhotplug.conf, when on battery it will send a unplug to cores from 2 to 11, keeping only cpu0 and cpu1(first cpu + HT) and the last 4 cores which are the Economic cores(12,13,14,15) for better battery life.
  • Optimus-manager so, nvidia is only used when needed.

laptop-mode-tools will put the desired cores to sleep by issuing echo 0 > /sys/devices/system/cpu/cpuY/online (replace Y with core index number).

The thing is: The error does not reproduces when I unplug the power cord, and those CPUs are put to sleep, but when I replug the power, nvidia-powerd crashes hard the machine and it gets a hard reboot making me lose all that was opened.

Relevant logs:

Nov 29 22:03:19 sandworm /usr/bin/nvidia-powerd[31906]: nvidia-powerd version:1.0(build 1)
Nov 29 22:03:20 sandworm /usr/bin/nvidia-powerd[31906]: Error, malformed CPU data.
Nov 29 22:03:20 sandworm nvidia-powerd[31906]: terminate called after throwing an instance of 'std::runtime_error'
Nov 29 22:03:20 sandworm nvidia-powerd[31906]:   what():  cpuid_error
Nov 29 22:03:20 sandworm systemd[1]: Started Process Core Dump (PID 31913/UID 0).
░░ Subject: A start job for unit systemd-coredump@1-31913-0.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit systemd-coredump@1-31913-0.service has finished successfully.
░░ 
░░ The job identifier is 5548.
Nov 29 22:03:21 sandworm systemd-coredump[31914]: [🡕] Process 31906 (nvidia-powerd) of user 0 dumped core.
                                                  
        Module nvidia-powerd without build-id.
        Stack trace of thread 31912:
        #0  0x00007f38c962783c n/a (libc.so.6 + 0x8e83c)
        #1  0x00007f38c95d7668 raise (libc.so.6 + 0x3e668)
        #2  0x00007f38c95bf4b8 abort (libc.so.6 + 0x264b8)
        #3  0x000000000041c6b5 n/a (nvidia-powerd + 0x1c6b5)
        #4  0x000000000041b036 n/a (nvidia-powerd + 0x1b036)
        #5  0x000000000041b071 n/a (nvidia-powerd + 0x1b071)
        #6  0x000000000041af13 n/a (nvidia-powerd + 0x1af13)
        #7  0x000000000040d9ff n/a (nvidia-powerd + 0xd9ff)
        #8  0x000000000040dd5f n/a (nvidia-powerd + 0xdd5f)
        #9  0x0000000000405322 n/a (nvidia-powerd + 0x5322)
        #10 0x00007f38c96259eb n/a (libc.so.6 + 0x8c9eb)
        #11 0x00007f38c96a97cc n/a (libc.so.6 + 0x1107cc)
                      
        Stack trace of thread 31911:
        #0  0x00007f38c98f14c6 n/a (ld-linux-x86-64.so.2 + 0x214c6)
        #1  0x00007f38c98d713b n/a (ld-linux-x86-64.so.2 + 0x713b)
        #2  0x00007f38c98d86b1 n/a (ld-linux-x86-64.so.2 + 0x86b1)
        #3  0x00007f38c98d2715 n/a (ld-linux-x86-64.so.2 + 0x2715)
        #4  0x00007f38c98d14e1 _dl_catch_exception (ld-linux-x86-64.so.2 + 0x14e1)
        #5  0x00007f38c98d2b75 n/a (ld-linux-x86-64.so.2 + 0x2b75)
        #6  0x00007f38c98dc0b1 n/a (ld-linux-x86-64.so.2 + 0xc0b1)
        #7  0x00007f38c98d14e1 _dl_catch_exception (ld-linux-x86-64.so.2 + 0x14e1)
        #8  0x00007f38c98db81a n/a (ld-linux-x86-64.so.2 + 0xb81a)
        #9  0x00007f38c98d14e1 _dl_catch_exception (ld-linux-x86-64.so.2 + 0x14e1)
        #10 0x00007f38c98dbbec n/a (ld-linux-x86-64.so.2 + 0xbbec)
        #11 0x00007f38c96219ec n/a (libc.so.6 + 0x889ec)
        #12 0x00007f38c98d14e1 _dl_catch_exception (ld-linux-x86-64.so.2 + 0x14e1)
        #13 0x00007f38c98d1603 n/a (ld-linux-x86-64.so.2 + 0x1603)
        #14 0x00007f38c96214f7 n/a (libc.so.6 + 0x884f7)
        #15 0x00007f38c9621aa1 dlopen (libc.so.6 + 0x88aa1)
        #16 0x0000000000406eb5 n/a (nvidia-powerd + 0x6eb5)
        #17 0x0000000000406a64 n/a (nvidia-powerd + 0x6a64)
        #18 0x00007f38c96259eb n/a (libc.so.6 + 0x8c9eb)
        #19 0x00007f38c96a97cc n/a (libc.so.6 + 0x1107cc)
                      
        Stack trace of thread 31906:
        #0  0x00007f38c96a53af ioctl (libc.so.6 + 0x10c3af)
        #1  0x0000000000410969 n/a (nvidia-powerd + 0x10969)
        #2  0x0000000000411a72 n/a (nvidia-powerd + 0x11a72)
        #3  0x000000000041296c n/a (nvidia-powerd + 0x1296c)
        #4  0x0000000000403cb7 n/a (nvidia-powerd + 0x3cb7)
        #5  0x0000000000402eca n/a (nvidia-powerd + 0x2eca)
        #6  0x000000000040344c n/a (nvidia-powerd + 0x344c)
        #7  0x0000000000402d1a n/a (nvidia-powerd + 0x2d1a)
        #8  0x000000000040277b n/a (nvidia-powerd + 0x277b)
        #9  0x00007f38c95c0cd0 n/a (libc.so.6 + 0x27cd0)
        #10 0x00007f38c95c0d8a __libc_start_main (libc.so.6 + 0x27d8a)
        #11 0x0000000000402915 n/a (nvidia-powerd + 0x2915)
        ELF object binary architecture: AMD x86-64

░░ Subject: Process 31906 (nvidia-powerd) dumped core
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ Documentation: man:core(5)
lines 1829-1905/1910 100%

And, it happen again today after waking up the laptop from suspension.
Computer worked for 2 minutes(which I was web browsing so, no nvidia activity) and nvidia-powerd decided to crash my laptop again…


░░ Subject: A start job for unit asusd.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit asusd.service has finished successfully.
░░ 
░░ The job identifier is 2872.
Dec 01 19:09:54 sandworm wpa_supplicant[3266]: wlo1: CTRL-EVENT-REGDOM-CHANGE init=DRIVER type=WORLD
Dec 01 19:09:54 sandworm systemd-coredump[4016]: [🡕] Process 3026 (nvidia-powerd) of user 0 dumped core.
                                                 
                                                 Module nvidia-powerd without build-id.
                                                 Stack trace of thread 3991:
                                                 #0  0x00007f9cbc5ac83c n/a (libc.so.6 + 0x8e83c)
                                                 #1  0x00007f9cbc55c668 raise (libc.so.6 + 0x3e668)
                                                 #2  0x00007f9cbc5444b8 abort (libc.so.6 + 0x264b8)
                                                 #3  0x000000000041c6b5 n/a (nvidia-powerd + 0x1c6b5)
                                                 #4  0x000000000041b036 n/a (nvidia-powerd + 0x1b036)
                                                 #5  0x000000000041b071 n/a (nvidia-powerd + 0x1b071)
                                                 #6  0x000000000041af13 n/a (nvidia-powerd + 0x1af13)
                                                 #7  0x000000000040d9ff n/a (nvidia-powerd + 0xd9ff)
                                                 #8  0x000000000040dd5f n/a (nvidia-powerd + 0xdd5f)
                                                 #9  0x0000000000405322 n/a (nvidia-powerd + 0x5322)
                                                 #10 0x00007f9cbc5aa9eb n/a (libc.so.6 + 0x8c9eb)
                                                 #11 0x00007f9cbc62e7cc n/a (libc.so.6 + 0x1107cc)
                                                 
                                                 Stack trace of thread 3990:
                                                 #0  0x00007f9cbc62e7bd n/a (libc.so.6 + 0x1107bd)
                                                 #1  0x00000000012e2c80 n/a (n/a + 0x0)
                                                 ELF object binary architecture: AMD x86-64
░░ Subject: Process 3026 (nvidia-powerd) dumped core
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ Documentation: man:core(5)
░░ 
░░ Process 3026 (nvidia-powerd) crashed and dumped core.
░░ 
░░ This usually indicates a programming error in the crashing program and
░░ should be reported to its vendor as a bug.
Dec 01 19:09:54 sandworm systemd[1]: systemd-coredump@0-4007-0.service: Deactivated successfully.
░░ Subject: Unit succeeded

I don’t even know where to start investigating this…

If the reboot isn’t triggered by you, it’s a firmware issue. If it’s “just” freezing, it’s a kernel or driver issue. The backtrace of nvidia-powerd doesn’t help as it’s just a user space process. Any kernel backtraces to be found? If so, the nvidia modules involved?
“cpuid_error” with the symptoms you describe point to a kernel bug triggering a firmware bug imho.

For investigating, maybe take xorg out of the equation.

You mean, try wayland?

It is currently producing some weird black artifacts on KDE and Sway, and the patched Hyprland to allow nvidia has some considerable lack of performance on gaming(like about 20%) so, while it might work, nvidia needs to put some more love on wayland here.

As for the crash, there is no Call Trace, modules errors or other additional module or firmware/MCU logs here. The laptop just goes poweroff pretty much like when you pull the plug on a server. Last log is from nvidia-powerd.

I’ve mitigated this by issuing a systemctl stop nvidia-powerd when I activate the hotplug detach cores and, when cpu cores get re-attached and nvidia-powerd isn’t running, there is no crash so definitively, this SIGABRT on this software is doing something nasty here.

My current Xorg conf is:

[nwildner@sandworm xorg.conf.d]$ cat 10-optimus-manager.conf 
Section "Files"
	ModulePath "/usr/lib/nvidia"
	ModulePath "/usr/lib32/nvidia"
	ModulePath "/usr/lib32/nvidia/xorg/modules"
	ModulePath "/usr/lib32/xorg/modules"
	ModulePath "/usr/lib64/nvidia/xorg/modules"
	ModulePath "/usr/lib64/nvidia/xorg"
	ModulePath "/usr/lib64/xorg/modules"
EndSection

Section "ServerLayout"
	Identifier "layout"
	Screen 0 "integrated"
	Inactive "nvidia"
	Option "AllowNVIDIAGPUScreens"
EndSection

Section "Device"
	Identifier "integrated"
	Driver "modesetting"
	BusID "PCI:0:0:2:0"
	Option "DRI" "3"
EndSection

Section "Screen"
	Identifier "integrated"
	Device "integrated"
EndSection

Section "Device"
	Identifier "nvidia"
	Driver "nvidia"
	BusID "PCI:0:1:0:0"
	Option "Coolbits" "28"
EndSection

Section "Screen"
	Identifier "nvidia"
	Device "nvidia"
EndSection

Section "ServerFlags"
	Option "IgnoreABI" "1"
EndSection

I just meant to stop any graphics to maybe catch a kernel backtrace on the text console. Since it powers off directly, this wouldn’t help, I guess.