Hi,
I’m using a W541 Thinkpad which has a nVidia Quadro K1100M as well as an onboard graphics chip. This laptop runs Arch Linux and I use nvidia-xrun to start a Xorg instance that uses the GPU. This setup worked in the past, but I attempted to perform an update yesterday and I have been getting segfaults when attempting to run nvidia-xrun ever since; the server fails to start.
I should mention that the update didn’t go smoothly since I aborted it which caused it to delete many libraries and left the system in an unusable state. I was able to rerun the update from a usb-boot which fixed the issues. I also performed a SMART test and checked dmesg for errors from the SSD but everything seems fine.
Running Xorg with the intel driver and the onboard chip also works fine. I reinstalled the drivers as well as the xorg-server, but this had no impact.
My xorg version is 1.30.13-3, the linux version is 5.19.1and the driver version is 470.141.03-1.
Here is one of these crashes:
[ 157.457] (II) LoadModule: "glx"
[ 157.458] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[ 157.466] (II) Module glx: vendor="X.Org Foundation"
[ 157.466] compiled for 1.20.13, module version = 1.0.0
[ 157.466] ABI class: X.Org Server Extension, version 10.0
[ 157.466] (II) LoadModule: "nvidia"
[ 157.466] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[ 157.476] (EE)
[ 157.476] (EE) Backtrace:
[ 157.476] (EE) 0: /usr/lib/Xorg (xorg_backtrace+0x5b) [0x55fa8adcf72b]
[ 157.476] (EE) 1: /usr/lib/Xorg (0x55fa8ac89000+0x151385) [0x55fa8adda385]
[ 157.476] (EE) 2: /usr/lib/libc.so.6 (0x7ff4910c3000+0x38a40) [0x7ff4910fba40]
[ 157.476] (EE) 3: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7ff48fa00000+0xe6060) [0x7ff48fae6060]
[ 157.476] (EE) 4: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7ff48fa00000+0x3dc6c) [0x7ff48fa3dc6c]
[ 157.476] (EE) 5: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7ff48fa00000+0x4c79f6) [0x7ff48fec79f6]
[ 157.476] (EE)
[ 157.476] (EE) Segmentation fault at address 0x7ff48fa3dc00
[ 157.476] (EE)
Fatal server error:
[ 157.476] (EE) Caught signal 11 (Segmentation fault). Server aborting
[ 157.476] (EE)
[ 157.476] (EE)
I’d be thankful for any advice on how to fix this issue.
That’s a low level issue, at some point on boot the gpu vanishes from the bus and kernel config and the driver unloads. Do you have some kind of obscure udev rule? Any systemd unit fiddling with the gpu?
grep 10de /lib/udev/rules.d/*
I don’t recall writing any special rules. The grep command does not return anything. But there is one file in the directory, I assume it’s installed by one of the driver packages:
# Make sure device nodes are present even when the DDX is not started for the Wayland/EGLStream case
KERNEL=="nvidia", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $$(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'"
KERNEL=="nvidia", RUN+="/usr/bin/bash -c 'for i in $$(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia$${i} c $$(grep nvidia-frontend /proc/devices | cut -d \ -f 1) $${i}; done'"
KERNEL=="nvidia_modeset", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-modeset c $$(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 254'"
KERNEL=="nvidia_uvm", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-uvm c $$(grep nvidia-uvm /proc/devices | cut -d \ -f 1) 0'"
KERNEL=="nvidia_uvm", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-uvm-tools c $$(grep nvidia-uvm /proc/devices | cut -d \ -f 1) 1'"
Any systemd unit fiddling with the gpu?
Yes, I think that could be nvidia-xrun-pm.service. I enabled it hoping that it might fix the issue. I have disabled it, rebooted and generated a new log after running nvidia-xrun again. nvidia-bug-report.log.gz (937.5 KB)
I believe this might be the work of nvidia-xrun. It probably disables the graphics card after the server failed to avoid issues when attempting to use the intel driver.
This time I used startx and didn’t use nvidia-xrun the card still showed up in lspci -k after the server failed. nvidia-bug-report.log.gz (975.3 KB)
Odd behaviour of nvidia-xrun. Though the gpu is now there, it’s inaccesible, check nvidia-smi.
Please set kernel parameter
ibt=off
to check if this is some issue with mitigations.
I was able to make some progress and get the system working again.
Prior to the update my kernel version was 5.18.6 and the driver version was 470.129.06-1. This combination still runs on my system.
However, starting with 5.19 the dkms install seems to fail. There seems to be some kind of incompatibility with the newer kernel versions. Additionally, version 470.141.03-1 of the driver causes segfaults on both 5.18.6 and 5.19.1. So in my case: I downgraded the kernel to 5.18.16 and the driver to 470.129.06-1.
I am unsure if I should mark this as a solution since the latest driver version still causes segfaults on my system.
IDK, there seems to be something wrong with the build process, the make.log contains tenthousands of warnings /var/lib/dkms/nvidia/470.141.03/build/nvidia-modeset.o: warning: objtool: _nv002593kms+0x16f: 'naked' return found in RETHUNK build
maybe also check with the Arch maintainers.
Thanks for pointing that out. I deleted the old build files and the cached data and rebuilt the driver packages. I probably had a faulty build since my system got corrupted during the last update and was using that faulty build. I am now able to use the latest version of the driver together with the latest version of the linux kernel.