Rocky Linux trouble with 2x A5000

I was previously having trouble with Rocky and Vulcan (Rocky Linux 9, Vulkan troubles) which was driver related and managed to resolve itself. Now, I’m seeing a problem similar to: Problems with NVIDIA RTX A5000's on Rocky Linux 9.3

To reiterate, there are 2X A5000 on a Boxx workstation (motherboard a Taichi Z790). Until now (driver version 550.54.15) things have been working smoothly. The boards have NVidiaLink, and according to the motherboard documentation, the two slots I have the cards in are pcie5, but back down to 8x when two cards are included, and supports something called ‘CrossFire’ which I can find no way to turn on/off, but the marketing assures me it is better than sliced bread.

On startup, I get the dreaded black screen where the gdm login screen should be. Looking at the X logs I see:

[ 68377.085] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:1:0:0.  Please
[ 68377.085] (EE) NVIDIA(GPU-0):     check your system's kernel log for additional error
[ 68377.085] (EE) NVIDIA(GPU-0):     messages and refer to Chapter 8: Common Problems in the
[ 68377.085] (EE) NVIDIA(GPU-0):     README for additional information.
[ 68377.085] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
[ 68377.085] (EE) NVIDIA(0): Failing initialization of X screen
[ 68377.744] (EE) Screen(s) found, but none have a usable configuration.
[ 68377.744] (EE) 
[ 68377.744] (EE) no screens found(EE) 

I have tried all sorts of xorg.conf manipulations, specifying the PCI index, etc etc. Even deleting it and putting a minimal one in there, no luck.

Looking at the kernel messages it looks like everything is OK

[    4.404950] nvidia: loading out-of-tree module taints kernel.
[    4.404956] nvidia: module license 'NVIDIA' taints kernel.
[    4.414450] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    4.457216] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    4.458080] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    4.486132] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input12
[    4.486180] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input13
[    4.486224] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input14
[    4.486255] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card2/input15
[    4.486274] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input16
[    4.486351] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input17
[    4.486372] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input18
[    4.486395] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.1/0000:02:00.1/sound/card3/input19
[    4.502434] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    4.545344] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
[    4.548273] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.54.15  Tue Mar  5 21:59:57 UTC 2024
[    4.551253] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    4.551254] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[    4.551275] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[    4.551276] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 1

Both cards are recognized, etc. I don’t see anything too worrisome but also don’t have extensive experience with the idiosyncrasies of Rocky / RHEL.

I have resorted to the often-cited Chapter 8 and started messing around with the kernel flags but haven’t found anything quite satisfying yet (and, in one case I apparently damaged things so much I couldn’t even ssh back in or boot in recovery from grub and had to do my fun-filled manual foo from a USB to get it back alive).

My question - any other suggestions for tracking this down? Like I said, setting the kernel parameters seems to be an OK approach, but I’m still not loving it since I don’t want to lose contact with the machine again :/

(fun edit) Checking to see if SLI was activated via nvidia-settings -q all | grep -I sli locked the machine up! Weee!

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Please post the output of

sudo cat /sys/module/nvidia_drm/parameters/modeset

I’m sorry - it was sitting here @ my desktop and not sent.

I get the ‘N’ from modeset, and

nvidia-bug-report.log.gz (1.0 MB)

There are no errors in the log and it’s running kde, not gnome. Wrong logfile sent?

Yeah - I noticed that - I generated it after it started magically ‘working’ again. Of course, like taking the car to the mechanic and all that. The errors in the original message were there, I swear.

So - I’ll keep an eye out for this to happen again and post it here. I can tell you that I went in and experimented with several kernel launch parameters from the ‘Chapter 8’ document. Each one crippled the machine to the point where I had to boot via USB, clean up the grub2 boot stuff, change it, reboot.

After none of them seemed to work, I reverted the grub.conf to its original state, rebuilt grub, rebooted, and its running fine since (48 hrs now, fingers crossed).

So - I suppose ‘it’s just me’ for right now, but this happened about a month ago and it spontaneously reappeared last week. It did all happen when I switched from Gnome to see how well KDE would perform, which made me panic that something KDE did ‘broke’ everything. Grr - sorry for the false alarm-ish-0-detail log.

OK, as of today, only 1 of the A5000 is being ‘seen’ by nvidia-smi - but is ‘seen’ by lspci. I just rebooted a bit ago and there was much chaos, e.g., not even completing the boot sequence. I haven’t changed any of the boot parameters, etc. So this is very much ‘where the problem was’ when I originally posted, so, hopefully @generix you can get some use out of this logfile -
nvidia-bug-report.log.gz (120.4 KB)

I need to scan through it too, just to learn more things about stuff.

Furthermore - I just was running some GPU intensive things, even in the ‘single card’ mode, and, when I went to run nvtop to see about resource usage, the box is frozen solid. Like, icy solid.

One gpu is constantly failng:
[ 17.390971] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1589)
Please try reseating it in its slot, check if it work in another system. If not, check warranty status and replace.

OK - yes, I reseated them, swapped them between themselves. Haven’t swapped with another working one since, well, they’re not terribly available here in the lab, so I’ll do my best. Is it possible to tell -which- slot is mapped to that card?

This appears to be something else. I grabbed another card and cycled it through the slots. Was always able to boot reliably with a single card of the three installed. I was mostly able to boot with both cards installed and no NVLink, I was able to get it to boot with NV link a few times, but removing the nvlink I still got a few no-boots. It boots to a point where I get some complaints about iwlwifi WRT: Invalid buffer destination and freezes after that. I see some discussions in Ubuntu groups (I’m running Rocky 9) where changing Nvidia driver versions helped with that and allowed the booting to continue.

When I boot into a rescue USB, dmesg | grep iwlwifi gives me no such complaints, fwiw.

Looking through the boot logs,

[ 4.877028] nvidia-nvlink: Nvlink Core is being initialized, major device number 235 comes immediately after the iwlwifi kernel module load, so I’m guessing it has something to do with it hanging there?

Each time I’ve been able to get it to boot Ive pulled a Nvidia-bug-report to maybe compare later. It’s a fascinating little adventure here.

I have been installing the drivers via run files to keep the whole distribution thing out of the equation, but maybe that’s a fool’s errand?

Did you also try using an earlier driver, e.g. 535?

I did previously. But not systematically. I don’t recall any of these problems with the 530 ’series’ of drivers.

I have temporarily put my A6000 back in this box because, well, there’s work to be done. When I get back in town next week I’ll put the 5000s back in there and try again. It’s running on the 535 right now, as an experiment. I had a hiccup and ran the bug_report shell script and it caused a kernel panic. Uninstall/reinstall and it’s back in business for now.