Problems with NVIDIA RTX A5000's on Rocky Linux 9.3

Greetings,

I am having an issue where right after logging in to user account, blank screen shows and GNOME desktop never loads at all. I don’t see any cursor movement or anything displaying. Monitor still detects a signal and backlight of monitor stays on but linux desktop never loads. I am able to see all booting text, BIOS, grub boot screen and the Rocky Linux visual login screen to select account. But as soon as I type password and hit enter, goes to blank screen and nothing ever loads visually.

I have 4 RTX A500’s in this desktop (no SLI) and have the following driver and linux versions:

NVIDIA-SMI 545.23.08
Driver Version: 545.23.08
CUDA Version: 12.3
Rocky Linux 9.3 (Blue Onyx)
Linux Kernel: 5.14.0-362.18.1.el9_3.x86_64

Will attach debug logs. Any assistance in troubleshooting this would be much appreciated!

nvidia-bug-report (1).log (1.7 MB)

Some other things I have tried for troubleshooting:

-Tried different monitor and display port cables, same exact symptoms. Monitor detects display but its all black as soon you try to load GNOME desktop.

-Tried rebooting with DP cable on different GPU’s, only the first top most GPU card will display anything prior to logging in to local linux account.

-Tried forcing Xorg and Wayland in/etc/gdm/custom.conf

When I look at the gdm service I get the following:
systemd[1]: Starting GNOME Display Manager…
systemd[1]: Started GNOME Display Manager.

However, once I then try to login and I check gdm service status again, then I get the following:
Gdm: GdmDisplay: Session never registered, failing

The log is from a system with dual rtx 6000 ada, rtx 500 is a mobile gpu.
Please try setting kernel parameter nvidia-drm.modeset=1

So sorry for the confusion, I posted the wrong log file. Correct one attached.

Also meant to put RTX A5000 not 500.
nvidia-bug-report.log (3.7 MB)

First, please make sure

sudo cat /sys/module/nvidia_drm/parameters/modeset

returns “N”
Then please delete
/etc/X11/xorg.conf.d/10-nvidia.conf
and create a new /etc/X11/xorg.conf only containing

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    BusID          "PCI:2:0:0"
    Option         "BaseMosaic" "true"
EndSection

then reboot. In case it still doesn’t work, please create a new nvidia bug-report.log.

Hi generix,

I verified that sudo cat /sys/module/nvidia_drm/parameters/modeset returned a “N” value.

I was also able to delete /etc/X11/xorg.conf.d/10-nvidia.conf and I created a new file called “/etc/X11/xorg.conf” with only the parameters you provided.

Unfortunately it seemed to make the issue slightly worse as now the Rocky Linux graphical login screen no longer comes up and is just a blinking cursor when GNOME manager tries to launch. I can do ctrl + alt +d and textual login will come up. But seems like x sessions are being created and failing over and over.

Mar 25 12:13:53 D02271957 systemd[1]: Starting GNOME Display Manager…
Mar 25 12:13:53 D02271957 systemd[1]: Started GNOME Display Manager.
Mar 25 12:14:23 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:14:23 D02271957 gdm[3109]: Gdm: Child process -3305 was already dead.
Mar 25 12:14:23 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:14:23 D02271957 gdm[3109]: Gdm: Child process -3305 was already dead.
Mar 25 12:15:03 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:15:03 D02271957 gdm[3109]: Gdm: Child process -3638 was already dead.
Mar 25 12:15:03 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:15:03 D02271957 gdm[3109]: Gdm: Child process -3638 was already dead.
Mar 25 12:15:32 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:15:32 D02271957 gdm[3109]: Gdm: Child process -3694 was already dead.
Mar 25 12:15:32 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:15:32 D02271957 gdm[3109]: Gdm: Child process -3694 was already dead.
Mar 25 12:16:01 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:16:01 D02271957 gdm[3109]: Gdm: Child process -3723 was already dead.
Mar 25 12:16:01 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:16:01 D02271957 gdm[3109]: Gdm: Child process -3723 was already dead.
Mar 25 12:16:31 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:16:31 D02271957 gdm[3109]: Gdm: Child process -3780 was already dead.
Mar 25 12:16:31 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:16:31 D02271957 gdm[3109]: Gdm: Child process -3780 was already dead.
Mar 25 12:17:00 D02271957 gdm[3109]: Gdm: GdmDisplay: Session never registered, failing
Mar 25 12:17:00 D02271957 gdm[3109]: Gdm: GdmLocalDisplayFactory: maximum number of X display failures reached: check X server log for errors
Mar 25 12:17:00 D02271957 gdm[3109]: Gdm: Child process -3803 was already dead.

New log uploaded.
nvidia-bug-report-new-mar25.log (5.0 MB)

(EE) NVIDIA(GPU-0): Failed to initialize DMA.

Please set kernel parameter iommu=off

Hi generix,

Would I set this in the BIOS or is there a configuration file to edit.

Thanks.

https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/configuring-kernel-command-line-parameters_managing-monitoring-and-updating-the-kernel#changing-kernel-command-line-parameters-for-all-boot-entries_configuring-kernel-command-line-parameters

Just to confirm, I would do grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="iommu=off"

I am seeing some references that the parameter is actually amd_iommu= and I do have an AMD CPU in this system so wasn’t sure which one to try.

Thank you.

You should rather use kernel=ALL, otherwise this gets lost on kernel update.
iommu=off is the all-off-no-questions-asked parameter. amd_iommu is a fine-tuning parameter, only turning off amd hw iommu.

Got it, appreciate the update!

Just to understand the underlying issue, is this related to the following issue with AMD and SME?
https://download.nvidia.com/XFree86/Linux-x86_64/450.57/README/dma_issues.html

And turning off IOMMU will correct this? I do see related IO errors in dmesg:

nvidia 0000:42:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000e address=0xffb15000 flags=0x0000]
[ 88.932848] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0025 address=0xffb15000 flags=0x0000]
[ 88.932854] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0xffb15000 flags=0x0000]

No, this has been fixed long ago.
In order for Mosaic to work, the gpus need to be able to communicate. If not correctly configured by the mainboard’s bios, iommu will block this as it isolates the slots, unless SLI capable.

Hi generix,

I was able to disable IOMMU through the BIOS and this resolved the issue!

Thank you so much for all the help. You have been very helpful.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.