470.141.03-1 xorg 1.20.13-3 segfault after update

Hi,
I’m using a W541 Thinkpad which has a nVidia Quadro K1100M as well as an onboard graphics chip. This laptop runs Arch Linux and I use nvidia-xrun to start a Xorg instance that uses the GPU. This setup worked in the past, but I attempted to perform an update yesterday and I have been getting segfaults when attempting to run nvidia-xrun ever since; the server fails to start.

I should mention that the update didn’t go smoothly since I aborted it which caused it to delete many libraries and left the system in an unusable state. I was able to rerun the update from a usb-boot which fixed the issues. I also performed a SMART test and checked dmesg for errors from the SSD but everything seems fine.

Running Xorg with the intel driver and the onboard chip also works fine. I reinstalled the drivers as well as the xorg-server, but this had no impact.

My xorg version is 1.30.13-3, the linux version is 5.19.1and the driver version is 470.141.03-1.

Here is one of these crashes:

[   157.457] (II) LoadModule: "glx"
[   157.458] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[   157.466] (II) Module glx: vendor="X.Org Foundation"
[   157.466] 	compiled for 1.20.13, module version = 1.0.0
[   157.466] 	ABI class: X.Org Server Extension, version 10.0
[   157.466] (II) LoadModule: "nvidia"
[   157.466] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[   157.476] (EE) 
[   157.476] (EE) Backtrace:
[   157.476] (EE) 0: /usr/lib/Xorg (xorg_backtrace+0x5b) [0x55fa8adcf72b]
[   157.476] (EE) 1: /usr/lib/Xorg (0x55fa8ac89000+0x151385) [0x55fa8adda385]
[   157.476] (EE) 2: /usr/lib/libc.so.6 (0x7ff4910c3000+0x38a40) [0x7ff4910fba40]
[   157.476] (EE) 3: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7ff48fa00000+0xe6060) [0x7ff48fae6060]
[   157.476] (EE) 4: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7ff48fa00000+0x3dc6c) [0x7ff48fa3dc6c]
[   157.476] (EE) 5: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7ff48fa00000+0x4c79f6) [0x7ff48fec79f6]
[   157.476] (EE) 
[   157.476] (EE) Segmentation fault at address 0x7ff48fa3dc00
[   157.476] (EE) 
Fatal server error:
[   157.476] (EE) Caught signal 11 (Segmentation fault). Server aborting
[   157.476] (EE) 
[   157.476] (EE) 

I’d be thankful for any advice on how to fix this issue.

nvidia-bug-report.log.old.gz (920.9 KB)

That’s a low level issue, at some point on boot the gpu vanishes from the bus and kernel config and the driver unloads. Do you have some kind of obscure udev rule? Any systemd unit fiddling with the gpu?
grep 10de /lib/udev/rules.d/*

Do you have some kind of obscure udev rule?

I don’t recall writing any special rules. The grep command does not return anything. But there is one file in the directory, I assume it’s installed by one of the driver packages:

# Make sure device nodes are present even when the DDX is not started for the Wayland/EGLStream case
KERNEL=="nvidia", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $$(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'"
KERNEL=="nvidia", RUN+="/usr/bin/bash -c 'for i in $$(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia$${i} c $$(grep nvidia-frontend /proc/devices | cut -d \  -f 1) $${i}; done'"
KERNEL=="nvidia_modeset", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-modeset c $$(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 254'"
KERNEL=="nvidia_uvm", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-uvm c $$(grep nvidia-uvm /proc/devices | cut -d \  -f 1) 0'"
KERNEL=="nvidia_uvm", RUN+="/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-uvm-tools c $$(grep nvidia-uvm /proc/devices | cut -d \  -f 1) 1'"

Any systemd unit fiddling with the gpu?

Yes, I think that could be nvidia-xrun-pm.service. I enabled it hoping that it might fix the issue. I have disabled it, rebooted and generated a new log after running nvidia-xrun again.
nvidia-bug-report.log.gz (937.5 KB)

Still gone after about 12 seconds.

I believe this might be the work of nvidia-xrun. It probably disables the graphics card after the server failed to avoid issues when attempting to use the intel driver.

This time I used startx and didn’t use nvidia-xrun the card still showed up in lspci -k after the server failed.
nvidia-bug-report.log.gz (975.3 KB)

Odd behaviour of nvidia-xrun. Though the gpu is now there, it’s inaccesible, check nvidia-smi.
Please set kernel parameter
ibt=off
to check if this is some issue with mitigations.

nvidia-smi segfaults

Aug 17 23:31:19 odessa systemd-coredump[762]: [🡕] Process 755 (nvidia-smi) of user 1000 dumped core.
                                              
                                              Module linux-vdso.so.1 with build-id 0b50b426a9c09c7a3dcbd9ef2237f69a3a99a80f
                                              Module libm.so.6 with build-id efeea58692a42176201df89f034aa4295a77ce74
                                              Module libcuda.so.1 with build-id 7e71ecb76a4807e6b07773ebefde70b9c92d3d06
                                              Module libnvidia-ml.so.1 without build-id.
                                              Module ld-linux-x86-64.so.2 with build-id 5492655bffbf172ed8a07f285f760ead38f09404
                                              Module librt.so.1 with build-id 837ea1d121976e9fa94acdf79939b387a32531db
                                              Module libc.so.6 with build-id 7d4293a9bbe1f068ab7ae807c2d9377395eb5b41
                                              Module libdl.so.2 with build-id 2b416df8fd62af5dc5e987b11d99a5d0f772b440
                                              Module libpthread.so.0 with build-id b966d4b239433c89ca36c0938381e2d92cd47639
                                              Module nvidia-smi without build-id.
                                              Stack trace of thread 755:
                                              #0  0x00007f9c87e3cae0 n/a (libcuda.so.1 + 0x43cae0)
                                              #1  0x00007f9c89959917 __cxa_finalize (libc.so.6 + 0x3a917)
                                              #2  0x00007f9c87b46a16 n/a (libcuda.so.1 + 0x146a16)
                                              #3  0x00007f9c87ef184d n/a (libcuda.so.1 + 0x4f184d)
                                              #4  0x00007f9c89b5605a n/a (ld-linux-x86-64.so.2 + 0x205a)
                                              #5  0x00007f9c89a6ed4e _dl_catch_exception (libc.so.6 + 0x14fd4e)
                                              #6  0x00007f9c89a6ee03 _dl_catch_error (libc.so.6 + 0x14fe03)
                                              #7  0x00007f9c899a133f n/a (libc.so.6 + 0x8233f)
                                              #8  0x00007f9c899a1096 dlclose (libc.so.6 + 0x82096)
                                              #9  0x00007f9c892a7489 n/a (libnvidia-ml.so.1 + 0xa7489)
                                              #10 0x00007f9c8921f799 n/a (libnvidia-ml.so.1 + 0x1f799)
                                              #11 0x00007f9c89237d28 nvmlSystemGetCudaDriverVersion_v2 (libnvidia-ml.so.1 + 0x37d28)
                                              #12 0x00000000004112c8 n/a (nvidia-smi + 0x112c8)
                                              #13 0x0000000000412416 n/a (nvidia-smi + 0x12416)
                                              #14 0x0000000000406dd7 n/a (nvidia-smi + 0x6dd7)
                                              #15 0x00007f9c899422d0 n/a (libc.so.6 + 0x232d0)
                                              #16 0x00007f9c8994238a __libc_start_main (libc.so.6 + 0x2338a)
                                              #17 0x00000000004074bd n/a (nvidia-smi + 0x74bd)
                                              ELF object binary architecture: AMD x86-64

I disabled ibt but it seems to have no impact.
nvidia-bug-report.log.gz (974.4 KB)

I was able to make some progress and get the system working again.
Prior to the update my kernel version was 5.18.6 and the driver version was 470.129.06-1. This combination still runs on my system.
However, starting with 5.19 the dkms install seems to fail. There seems to be some kind of incompatibility with the newer kernel versions. Additionally, version 470.141.03-1 of the driver causes segfaults on both 5.18.6 and 5.19.1. So in my case: I downgraded the kernel to 5.18.16 and the driver to 470.129.06-1.

I am unsure if I should mark this as a solution since the latest driver version still causes segfaults on my system.

IDK, there seems to be something wrong with the build process, the make.log contains tenthousands of warnings
/var/lib/dkms/nvidia/470.141.03/build/nvidia-modeset.o: warning: objtool: _nv002593kms+0x16f: 'naked' return found in RETHUNK build
maybe also check with the Arch maintainers.

1 Like

Thanks for pointing that out. I deleted the old build files and the cached data and rebuilt the driver packages. I probably had a faulty build since my system got corrupted during the last update and was using that faulty build. I am now able to use the latest version of the driver together with the latest version of the linux kernel.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.