Blank screen on boot and one of four GPUs missing in nvidia-smi

Previously, all four of the GPUs worked, and the computer was able to boot into a desktop while using Nvidia drivers. I’m not sure what changed to cause the problem. In effort to fix the issue, I’ve reinstalled Linux a few times, including Linux Mint 20 and now Ubuntu 20.04.

Booting brings up a mostly blank screen with a few lines like:

[   6.735192] EDAC skx: ECC is disabled on imc 0
[   6.807128] EDAC skx: ECC is disabled on imc 0

But that message doesn’t seem to be relevant.

I can reach a terminal with Alt + F6, with safe mode from GRUB, and via SSH. After running sudo prime select intel, the computer successfully boots into the GUI, but not while using the Nvidia drivers. It currently has nvidia-driver-440 installed, and I think that using 435 and 390 had the same issue.

The display is connected with an HDMI cable to the top GPU. The fans for all four GPUs are clearly spinning, but only three show up:

$ nvidia-smi 
Sat Jun 20 11:53:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:19:00.0 Off |                  N/A |
| 30%   33C    P0    66W / 300W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 30%   34C    P0    59W / 300W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:67:00.0 Off |                  N/A |
| 74%   33C    P0    62W / 300W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

All four GPUs show up in lspci:

$ lspci | grep -i vga
19:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
67:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] (rev a1)

Here’s some additional debugging info. lsmod:

$ lsmod | grep -i nvidia
nvidia_uvm            970752  0
nvidia_drm             49152  0
nvidia_modeset       1114112  1 nvidia_drm
nvidia              20430848  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        184320  1 nvidia_drm
ipmi_msghandler       106496  2 ipmi_devintf,nvidia
drm                   491520  3 drm_kms_helper,nvidia_drm
i2c_nvidia_gpu         16384  0

Just for clarity:

$ prime-select query
nvidia

And modprobe lines:

$ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/lib/modprobe.d/nvidia-kms.conf:# This file was generated by nvidia-prime
/lib/modprobe.d/nvidia-kms.conf:options nvidia-drm modeset=0

And dmesg:

$ dmesg | grep -i nvidia
[    3.349400] nvidia: loading out-of-tree module taints kernel.
[    3.349405] nvidia: module license 'NVIDIA' taints kernel.
[    3.395822] nvidia-nvlink: Nvlink Core is being initialized, major device number 234
[    3.396327] nvidia 0000:19:00.0: enabling device (0100 -> 0103)
[    3.396405] nvidia 0000:19:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    3.495800] nvidia 0000:1a:00.0: enabling device (0100 -> 0103)
[    3.495833] nvidia 0000:1a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    3.542828] audit: type=1400 audit(1592678884.870:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=769 comm="apparmor_parser"
[    3.542830] audit: type=1400 audit(1592678884.870:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=769 comm="apparmor_parser"
[    3.596459] nvidia 0000:67:00.0: enabling device (0100 -> 0103)
[    3.596489] nvidia 0000:67:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    3.695708] nvidia 0000:68:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    3.795123] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.64  Fri Feb 21 01:17:26 UTC 2020
[    3.818174] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.64  Fri Feb 21 00:43:19 UTC 2020
[    3.821329] [drm] [nvidia-drm] [GPU ID 0x00001900] Loading driver
[    3.821331] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:19:00.0 on minor 0
[    3.821384] [drm] [nvidia-drm] [GPU ID 0x00001a00] Loading driver
[    3.821385] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:1a:00.0 on minor 1
[    3.821443] [drm] [nvidia-drm] [GPU ID 0x00006700] Loading driver
[    3.821444] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:67:00.0 on minor 2
[    3.821487] [drm] [nvidia-drm] [GPU ID 0x00006800] Loading driver
[    3.821488] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:68:00.0 on minor 3
[    3.845029] nvidia-uvm: Loaded the UVM driver, major device number 510.
[    5.915496] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input32
[    5.915536] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input33
[    5.915578] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input34
[    5.915630] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:08.0/0000:19:00.1/sound/card1/input35
[    5.915670] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input28
[    5.915829] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input29
[    5.915866] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input24
[    5.915887] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input30
[    5.915934] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:16/0000:16:00.0/0000:17:00.0/0000:18:10.0/0000:1a:00.1/sound/card2/input31
[    5.916000] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input25
[    5.916041] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input26
[    5.916111] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:10.0/0000:68:00.1/sound/card4/input27
[    5.916133] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input36
[    5.916238] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input37
[    5.916265] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input38
[    5.916302] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:08.0/0000:67:00.1/sound/card3/input39
[    5.951073] caller os_map_kernel_space.part.0+0x73/0x80 [nvidia] mapping multiple BARs

And dmesg filtering for NVRM:

$ dmesg | grep NVRM
[    3.795123] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.64  Fri Feb 21 01:17:26 UTC 2020
[    6.322606] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[    6.322631] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3
[  324.486104] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[  324.486140] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3
[  354.688266] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  354.688299] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3

And the nearby messages in dmesg seem potentially relevant:

[  350.769617] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
[  350.769745] caller os_map_kernel_space.part.0+0x73/0x80 [nvidia] mapping multiple BARs
[  354.688266] NVRM: GPU 0000:68:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  354.688299] NVRM: GPU 0000:68:00.0: rm_init_adapter failed, device minor number 3

Does the “device minor number 3” mean that it’s probably the fourth GPU that’s not showing up in nvidia-smi?

I’ve attached an nvidia-bug-report.log file. Any help or debugging tips would be appreciated! I don’t need to be able to boot into a GUI, but I do want to be able to use all four GPUs…

nvidia-bug-report.log (2.2 MB)

Update: After removing the bottom GPU, only two GPUs showed up in nvidia-smi:

$ nvidia-smi
Sat Jun 20 14:48:59 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 31%   36C    P0    60W / 300W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:67:00.0 Off |                  N/A |
| 32%   35C    P0    62W / 300W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So, apparently GPU ID 0x00001900 is the bottom PCI slot, and GPU ID 0x00006800 is the slot that had the bad GPU.

After taking out the top GPU and putting the GPU that used to be in the bottom slot into the top slot, all three GPUs show up in nvidia-smi, dmesg doesn’t have any NVRM “RmInitAdapter failed!” messages, and the GUI shows up!

Does this mean that I should send the bad GPU back and ask for a replacement, or is there something else I should try first?

Yes, send it back to your vendor, it’s broken.
The rminit failed message means the gpu couldn’t be initialized, reseat it, check if it works in another system, if not, replace it because broken.

Thanks! Will do. The bad GPU also fails in the bottom slot:

[    7.990795] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x008360a7
[    8.177213] NVRM: GPU 0000:19:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[    8.177235] NVRM: GPU 0000:19:00.0: rm_init_adapter failed, device minor number 0
[    8.428065] IPv6: ADDRCONF(NETDEV_CHANGE): wlx34e8948adb4e: link becomes ready
[    9.002804] snd_hda_intel 0000:00:1f.3: No response from codec, resetting bus: last cmd=0x008360a7
[   12.323334] rfkill: input handler disabled
[   67.559242] NVRM: GPU 0000:19:00.0: RmInitAdapter failed! (0x26:0xffff:1227)
[   67.559275] NVRM: GPU 0000:19:00.0: rm_init_adapter failed, device minor number 0