Previously, all four of the GPUs worked, and the computer was able to boot into a desktop while using Nvidia drivers. I’m not sure what changed to cause the problem. In effort to fix the issue, I’ve reinstalled Linux a few times, including Linux Mint 20 and now Ubuntu 20.04.
Booting brings up a mostly blank screen with a few lines like:
[ 6.735192] EDAC skx: ECC is disabled on imc 0
[ 6.807128] EDAC skx: ECC is disabled on imc 0
But that message doesn’t seem to be relevant.
I can reach a terminal with Alt + F6, with safe mode from GRUB, and via SSH. After running sudo prime select intel
, the computer successfully boots into the GUI, but not while using the Nvidia drivers. It currently has nvidia-driver-440 installed, and I think that using 435 and 390 had the same issue.
The display is connected with an HDMI cable to the top GPU. The fans for all four GPUs are clearly spinning, but only three show up:
$ nvidia-smi
Sat Jun 20 11:53:56 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:19:00.0 Off | N/A |
| 30% 33C P0 66W / 300W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1A:00.0 Off | N/A |
| 30% 34C P0 59W / 300W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:67:00.0 Off | N/A |
| 74% 33C P0 62W / 300W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
All four GPUs show up in lspci
:
$ lspci | grep -i vga
19:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
Here’s some additional debugging info. lsmod
:
$ lsmod | grep -i nvidia
nvidia_uvm 970752 0
nvidia_drm 49152 0
nvidia_modeset 1114112 1 nvidia_drm
nvidia 20430848 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 1 nvidia_drm
ipmi_msghandler 106496 2 ipmi_devintf,nvidia
drm 491520 3 drm_kms_helper,nvidia_drm
i2c_nvidia_gpu 16384 0
Just for clarity:
$ prime-select query
nvidia
And modprobe lines:
$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/lib/modprobe.d/nvidia-kms.conf:# This file was generated by nvidia-prime
/lib/modprobe.d/nvidia-kms.conf:options nvidia-drm modeset=0
And dmesg
:
$ dmesg | grep -i nvidia
[ 3.349400] nvidia: loading out-of-tree module taints kernel.
[ 3.349405] nvidia: module license 'NVIDIA' taints kernel.
[ 3.395822] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[ 3.396327] nvidia 0000:19:00.0: enabling device (0100 -> 0103)
[ 3.396405] nvidia 0000:19:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 3.495800] nvidia 0000:1a:00.0: enabling device (0100 -> 0103)
[ 3.495833] nvidia 0000:1a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 3.542828] audit: type=1400 audit(1592678884.870:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=769 comm="apparmor_parser"
[ 3.542830] audit: type=1400 audit(1592678884.870:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=769 comm="apparmor_parser"
[ 3.596459] nvidia 0000:67:00.0: enabling device (0100 -> 0103)
[ 3.596489] nvidia 0000:67:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 3.695708] nvidia 0000:68:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 3.795123] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020
[ 3.818174] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.64 Fri Feb 21 00:43:19 UTC 2020
[ 3.821329] [drm] [nvidia-drm] [GPU ID 0x00001900] Loading driver
[ 3.821331] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:19:00.0 on minor 0
[ 3.821384] [drm] [nvidia-drm] [GPU ID 0x00001a00] Loading driver
[ 3.821385] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:1a:00.0 on minor 1
[ 3.821443] [drm] [nvidia-drm] [GPU ID 0x00006700] Loading driver
[ 3.821444] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:67:00.0 on minor 2
[ 3.821487] [drm] [nvidia-drm] [GPU ID 0x00006800] Loading driver
[ 3.821488] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:68:00.0 on minor 3
[ 3.845029] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 5.915496] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input32
[ 5.915536] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input33
[ 5.915578] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input34
[ 5.915630] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input35
[ 5.915670] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input28
[ 5.915829] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input29
[ 5.915866] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input24
[ 5.915887] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input30
[ 5.915934] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input31
[ 5.916000] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input25
[ 5.916041] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input26
[ 5.916111] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input27
[ 5.916133] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input36
[ 5.916238] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input37
[ 5.916265] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input38
[ 5.916302] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input39
[ 5.951073] caller os_map_kernel_space.part.0+0x73/0x80 [nvidia] mapping multiple BARs
And dmesg
filtering for NVRM:
$ dmesg | grep NVRM
[ 3.795123] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.64 Fri Feb 21 01:17:26 UTC 2020
[ 6.322606] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[ 6.322631] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3
[ 324.486104] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[ 324.486140] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3
[ 354.688266] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 354.688299] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3
And the nearby messages in dmesg
seem potentially relevant:
[ 350.769617] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
[ 350.769745] caller os_map_kernel_space.part.0+0x73/0x80 [nvidia] mapping multiple BARs
[ 354.688266] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[ 354.688299] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3
Does the “device minor number 3” mean that it’s probably the fourth GPU that’s not showing up in nvidia-smi
?
I’ve attached an nvidia-bug-report.log file. Any help or debugging tips would be appreciated! I don’t need to be able to boot into a GUI, but I do want to be able to use all four GPUs…
nvidia-bug-report.log (2.2 MB)