I was previously having trouble with Rocky and Vulcan (Rocky Linux 9, Vulkan troubles) which was driver related and managed to resolve itself. Now, I’m seeing a problem similar to: Problems with NVIDIA RTX A5000's on Rocky Linux 9.3
To reiterate, there are 2X A5000 on a Boxx workstation (motherboard a Taichi Z790). Until now (driver version 550.54.15) things have been working smoothly. The boards have NVidiaLink, and according to the motherboard documentation, the two slots I have the cards in are pcie5, but back down to 8x when two cards are included, and supports something called ‘CrossFire’ which I can find no way to turn on/off, but the marketing assures me it is better than sliced bread.
On startup, I get the dreaded black screen where the gdm login screen should be. Looking at the X logs I see:
[ 68377.085] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:1:0:0. Please
[ 68377.085] (EE) NVIDIA(GPU-0): check your system's kernel log for additional error
[ 68377.085] (EE) NVIDIA(GPU-0): messages and refer to Chapter 8: Common Problems in the
[ 68377.085] (EE) NVIDIA(GPU-0): README for additional information.
[ 68377.085] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[ 68377.085] (EE) NVIDIA(0): Failing initialization of X screen
[ 68377.744] (EE) Screen(s) found, but none have a usable configuration.
[ 68377.744] (EE)
[ 68377.744] (EE) no screens found(EE)
I have tried all sorts of xorg.conf manipulations, specifying the PCI index, etc etc. Even deleting it and putting a minimal one in there, no luck.
Looking at the kernel messages it looks like everything is OK
[ 4.404950] nvidia: loading out-of-tree module taints kernel.
[ 4.404956] nvidia: module license 'NVIDIA' taints kernel.
[ 4.414450] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 4.457216] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[ 4.458080] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4.486132] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input12
[ 4.486180] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input13
[ 4.486224] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input14
[ 4.486255] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input15
[ 4.486274] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input16
[ 4.486351] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input17
[ 4.486372] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input18
[ 4.486395] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input19
[ 4.502434] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 4.545344] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.15 Tue Mar 5 22:23:56 UTC 2024
[ 4.548273] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.54.15 Tue Mar 5 21:59:57 UTC 2024
[ 4.551253] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 4.551254] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 4.551275] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[ 4.551276] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 1
Both cards are recognized, etc. I don’t see anything too worrisome but also don’t have extensive experience with the idiosyncrasies of Rocky / RHEL.
I have resorted to the often-cited Chapter 8 and started messing around with the kernel flags but haven’t found anything quite satisfying yet (and, in one case I apparently damaged things so much I couldn’t even ssh back in or boot in recovery from grub and had to do my fun-filled manual foo from a USB to get it back alive).
My question - any other suggestions for tracking this down? Like I said, setting the kernel parameters seems to be an OK approach, but I’m still not loving it since I don’t want to lose contact with the machine again :/
(fun edit) Checking to see if SLI was activated via nvidia-settings -q all | grep -I sli locked the machine up! Weee!