Hi, I have 2 RTX 4090 installed on my workstation. I installed one GPU directly on the motherboard and the other one was connected by a PCIe 4.0 x16 extended line, because my motherboard don’t have enough space to install them simultaneously but it did have multiple PCIe ports. After that, my system (ubuntu 20.04) will crash when I’m using DP in pytorch but DDP is OK.
Recently, I updated my driver to 525.105.17 because my linux kernel was automatically updated. And the system crashed multiple times without any operation after this upgrade. I found the following message in /var/log/syslog
and some of them was related with GPU/pci:
(base) fgldlb@fgldlb:~$ cat /var/log/syslog | grep 21:0 | grep failed
Apr 1 21:06:31 localhost systemd-udevd[743]: controlC1: Process '/usr/sbin/alsactl -E HOME=/run/alsa -E XDG_RUNTIME_DIR=/run/alsa/runtime restore 1' failed with exit code 99.
Apr 1 21:06:31 localhost kernel: [ 1.914357] pci 0000:01:00.0: BAR 7: failed to assign [mem size 0x00100000 64bit]
Apr 1 21:06:31 localhost kernel: [ 1.914362] pci 0000:01:00.0: BAR 10: failed to assign [mem size 0x00100000 64bit]
Apr 1 21:06:31 localhost kernel: [ 1.914368] pci 0000:01:00.1: BAR 7: failed to assign [mem size 0x00100000 64bit]
Apr 1 21:06:31 localhost kernel: [ 1.914372] pci 0000:01:00.1: BAR 10: failed to assign [mem size 0x00100000 64bit]
Apr 1 21:06:31 localhost kernel: [ 6.488626] nvidia: module verification failed: signature and/or required key missing - tainting kernel
Apr 1 21:06:31 localhost systemd[1]: Starting GRUB failed boot detection...
Apr 1 21:06:31 localhost systemd[1]: Finished GRUB failed boot detection.
Apr 1 21:06:31 localhost udisksd[1247]: failed to load module mdraid: libbd_mdraid.so.2: cannot open shared object file: No such file or directory
Apr 1 21:06:31 localhost frpc[1297]: 2023/04/01 21:06:31 #033[1;33m[W] [service.go:105] login to server failed: dial tcp: lookup my.url on 127.0.0.53:53: server misbehaving#033[0m
Apr 1 21:06:31 localhost colord[1371]: failed to get edid data: EDID length is too small
Apr 1 21:06:31 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:31 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:31 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:31 localhost colord[1371]: message repeated 5 times: [ failed to get session [pid 1212]: No data available]
Apr 1 21:06:31 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:32 localhost colord[1371]: message repeated 17 times: [ failed to get session [pid 1212]: No data available]
Apr 1 21:06:32 localhost NetworkManager[1215]: <warn> [1680354392.2972] Error: failed to open /run/network/ifstate
Apr 1 21:06:36 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:39 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:39 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:41 localhost indicator-keybo[1880]: gtk_icon_theme_get_for_screen: assertion 'GDK_IS_SCREEN (screen)' failed
Apr 1 21:06:42 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:43 localhost colord[1371]: failed to get session [pid 1212]: No data available
Apr 1 21:06:43 localhost colord[1371]: message repeated 2 times: [ failed to get session [pid 1212]: No data available]
Apr 1 21:07:01 localhost pulseaudio[1575]: GetManagedObjects() failed: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.bluez': timed out (service_start_timeout=25000ms)
Apr 1 21:08:29 localhost pulseaudio[2003]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
(base) fgldlb@fgldlb:~$ cat /var/log/syslog | grep 21:0 | grep Error
Apr 1 21:06:31 localhost kernel: [ 1.957768] ERST: Error Record Serialization Table (ERST) support is initialized.
Apr 1 21:06:31 localhost kernel: [ 2.239363] RAS: Correctable Errors collector initialized.
Apr 1 21:06:32 localhost NetworkManager[1215]: <warn> [1680354392.2972] Error: failed to open /run/network/ifstate
Apr 1 21:06:41 localhost at-spi-bus-laun[1853]: Failed to register client: GDBus.Error:org.freedesktop.DBus.Error.UnknownMethod: No such method “RegisterClient”
Apr 1 21:06:41 localhost indicator-sound[1882]: media-player-list-greeter.vala:51: Unable to get active entry: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name com.canonical.UnityGreeter was not provided by any .service files
Apr 1 21:07:01 localhost pulseaudio[1575]: GetManagedObjects() failed: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.bluez': timed out (service_start_timeout=25000ms)
Apr 1 21:08:29 localhost pulseaudio[2003]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
I’m not sure what the problem is. The PCIe extended line? Wrong driver/system version? Compatibility between my motherboard & GPU? And how can I fix it?
Here are my logs:
nvidia-bug-report.log.gz (1.1 MB)
nvidia-uninstall.log (1.9 KB)
nvidia-installer.log (35.6 KB)
Thanks.