Dual NVIDIA P600 BaseMosaic freezes system if booting from UEFI with 390.42 during X restart

Testing dual P600 on a Dell 5820 and Dell 7810 with five displays the card with the fifth display with randomly (no pattern found as of right now) fail and the BaseMosaic configuration will revert to a single display setup.

See attached nvidia-bug-report.log.gz

When this happens we’ll see the following in /var/log/messages:

...
... kernel: NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
... kernel: NVRM: rm_init_adapter failed for device bearing minor number 1
...

Does RmInitAdapter suggest this is faulty hardware, power, motherboard/PCI related?

And the following in /var/log/Xorg.0.log:

...
(EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:3:0:0.  Please
(EE) NVIDIA(GPU-0):     check your system's kernel log for additional error
(EE) NVIDIA(GPU-0):     messages and refer to Chapter 8: Common Problems in the
(EE) NVIDIA(GPU-0):     README for additional information.
(EE) NVIDIA(GPU-0): Failed to initialize one NVIDIA graphics device!
(WW) NVIDIA(GPU-0): Failed to initialize Base Mosaic configuration.  Reason: One
(WW) NVIDIA(GPU-0):     GPU failed to initialize; Only one GPU will be used for
(WW) NVIDIA(GPU-0):     this X screen.
...

Interestingly the PCI id listed in the Xorg.0.log file is the card with the active displays not the card with the display that fails to light up.

Logging out (restarting X) will sometimes bring the BaseMosic config back. However, after a couple login/logout attempts the same issue occurs and the same messages appear from above.

This is possibly related to https://devtalk.nvidia.com/default/topic/1027427/nvidia-384-98-and-display-orientation-chaning-when-x-restarts-layout-less-xorg-conf/

A couple things we’ve tried so far:

  • nvidia-drm.modeset=1 when there are multiple GPUs will cause the machine to freeze once the GUI loads
  • nomodeset was required during installation (via a USB anaconda kickstart) otherwise all displays would go black and enter power saving mode after some point of the system loading
  • Not setting nomodeset appears to have the same effect as setting nomodeset
  • The same two P600 cards (only ones we have) were tested in a Dell 5820 and a Dell 7810 - The 5820 has the tendency to freeze and a hard power cycle is required - The 7810 is a bit more forgiving when it comes to this issue and doesn't freeze like the 5820
  • So far it doesn't appear to be WM related. We've seen the same issue happen when logging out of gnome-session and fvwm.
  • Using a different xorg.conf (different modeline) and two NVS 510s we do not have any issues with the 5820 and the 7810 which might suggest this is a P600 or P-series related issue

nvidia-bug-report.log.gz (139 KB)

Forgot to note, if this might suggest faulty hardware does nvidia have any type of diagnostic/stress-test utilities that would maybe push hardware related issues to the surface? I believe a couple years ago nvidia had something but it’s either behind a login (debug tools) or I’m not remembering the exact program/applications.

Both cards by itself work fine, seems like the driver is failing at tucking them together. Consistent over two different systems. Would point to a driver/timing issue. Did you check what happens when you connect 4 cards to the Dell P600 and only one to the Nvidia P600? Or a 2+3 combo?

4 cards? Do you mean 4 displays? If so, yes, Friday I tested only one card (PNY) with 4 displays and didn’t see any issues. Just adding the second card (Dell) with nothing attached to it I started to see issues, however this is with a simple xorg.conf (only device section) so the driver might be touching the “unused” video card in some manner.

I plan to write an xorg.conf that only calls out the card with displays attached to see if that makes any difference. Then later try the same tests but with the Dell card: both in the system but only using one and only one (Dell) connected to the system.

I can attempt a test were we have a different combination of displays (2+3, 3+2, 1+4) connected to each card.

Sorry I didn’t answer the question: the comments from the first post would be a 1 (Dell) + 4 (PNY) setup.

We’ll try a 4+1.

Some additional notes from today’s testing:

  • Both the 7810 and 5820 will freeze
  • nvidia-drm.modeset is required when dealing with a single card otherwise the system becomes very unstable
  • As noted above nvidia-drm.modeset with multiple GPUs the 7810 (and I believe the 5820) will freeze the system once the GUI loads. Tested by ping/ssh to the machine once "frozen."
  • It appears the system was semi-stable (still saw crashes, still saw displays not lighting up "randomly") when using a 2x3 or a 3x2 setup vs a 4x1 or 1x4 setup. Could this be a power issue? According to Dell's docs they're both 75W slots. 40W (P600) + extra for the four displays should be under 75 per-slot

I’m beginning to believe this is a modeset issue? Only because when we don’t set it the system becomes unstable.

Is there anyway to totally disable modesetting all together? Unless if there’s a special use case for it I see zero benefit for the features it provides.

Reading nvidia’s README nvidia-drm.modeset is experimental. Is this still the case or it is this from old documentation that needs to be updated? http://us.download.nvidia.com/XFree86/Linux-x86_64/384.111/README/kms.html

Would blacklisting nvidia-drm and nvidia-modeset be any different than setting the nomodeset or nvidia-drm.modeset=0 (the default) kernel parameters.

Both Dells are workstation models specifically designed for your setup, running two graphics cards with up to 600W combined power draw. So the two Quadros should be piece of cake for them.
I think you’re a bit confused because everything is just ‘modesetting’. Modesetting in general is just the process of setting the right mode for your monitor, i.e. resolution, clocks (programming the display engine of the gpu). So this is needed otherwise you won’t see anything, therefore you can’t blacklist nvidia-modeset and nvidia-drm modules.
Now there are two kinds of modesetting, the old user-space modesetting and the newer kernel modesetting (kms). All these parameters switch between those.
nomodeset is turning off kms in kernel alltogether.
nvidia-drm.modeset=1 turns on drm kms in the nvida driver.
So both parameters at once can’t be used.
With nvidia-drm.modeset=0, the module nvidia-modeset is taking care of modesetting.
With nvidia-drm.modeset=1, nvidia-drm is taking over parts of this alongside other things.
The drm kms implementation is still considered experimental because some things are simply not implemented or subject to change while other things are stable and widely used.

Back to your problem:
You said you were even having problems with just a single card installed? Do you have logs from that?

Thanks for the reply about modeset.

Yes, at first we did have issues with a single card however setting nvidia-drm.modeset=1 when you’re using one card appears to stabilize both systems (5820/7810). If you want to see logs from the system without nvidia-drm.modeset set (or set to 0) we can provide that however it appears we have the single GPU case covered/fixed/workaround.

However, when you have two GPUs configured with a BaseMosaic xorg.conf it appears the system crashes once X loads, or maybe once nvidia-modeset loads.

On the 5820 with two P600s and a BaseMosaic config the machine will kernel panic and auto restart.

See attached bug report with dmesg and backtrace from abrt logs. If you think any additional information is needed from the abrt directory of the vmcore kernel panic let me know. It’s ~250MB with a core file.

It appears this is nvidia-modeset crashing?
abrt-backtrace.txt (4.03 KB)
vmcore-dmesg.txt (101 KB)
dual-cards-BaseMosaic-crash-nvidia-bug-report.log.gz (213 KB)

Also, the bug report from Comment #1 would be a dual P600 with a BaseMosaic config without nvidia-drm.modeset=1. This will at least load X however after a couple logout attempts will freeze the system. No KP or auto-restart like Comment #9.

So there are four cases:

  1. Single GPU + modeset=0, not working
  2. Single GPU + modeset=1, working
  3. BaseMosaic + modeset=0, not really working
  4. BaseMosaic + modeset=1, kernel crash
    Case 4. is probably the case where modeset=1 gets experimental, dual GPU crashing in kms.
    So you will have to go back to 1. and see why that’s not working, did you check with drivers 387 or 390?

Yeah, those are the correct cases assuming it’s not hardware related which it doesn’t appear to be.

One change to #1 is it’s not working in the same way #3 doesn’t. It’s “not really working.”

However, that aside we’ll start 387 testing and 390 if needed.

Maybe we’ll just need to include 387 or the short lived driver in our testing until the transition from old (or older) GPU support to newer.

Sorry for the delay. No luck with 387 and 390. See attached nvidia-bug-report and abrt’s dmesg and backtrace for 390. Seem similar to 384’s. I didn’t get to capture and nvidia-bug-report or backtrace if it did kernel panic from 387.

However, based off Dell’s recommendation we should roll back to 7.3 and try 375.20 this too would kernel panic however I didn’t get it to freeze the system however I’m convinced at this point freeze or kernel panic plus auto-restart it’s the same issue. See attached 375 dmesg and backtrace.

All of these are similar xorg.conf: dual P600, BaseMosaic, 5 heads (1/4)
390-vmcore-dmesg.txt (101 KB)
390-nvidia-bug-report.log.gz (129 KB)
390-backtrace.txt (4.24 KB)
375-backtrace.txt (3.89 KB)

Also, during testing with 390 I did see an Xid error 56 “Display Engine error”

We see the same issues with the newly releated 390.25 elrepo packages on RHEL 7.4.
See attached freeze non-nvidia-drm.modeset debug log and with nvidia-drm.modeset=1 set.
390.25-freeze-nvidia-bug-report.log.gz (143 KB)
390.25-nvidia-drm.moderset=1-nvidia-bug-report.log.gz|attachment (145 KB)

The interesting thing about 390.25 is that when you set nvidia-drm.modeset=1 the machines doesn’t totally freeze when X attempts to load. You can still ssh in. Once in nothing looked out of place other than Xorg.x.log didn’t find any displays, is only using one card for the BaseMosaic config, or found an invalid config. I might be able to get a better nvidia-debug-log next week.

However, one thing to note (this is a guess) anything that would interact with the modeset driver will trigger a freeze. So once the screens go black (~ the same time X attempts to load), ssh in, reboot or ‘telinit 3’, freeze.

Forgot to post these logs from last week. I’d like to hope the after-freeze should be something like Comment #15’s 390.25-freeze-nvidia-bug-report.log.gz however, I like Comment #16 says I misspoke about the machine totally freezing during multiple GPU and nvidia-drm.modeset. You can still ssh but will freeze later.

See
post-Xload-390.25-nvidia-bug-report.log.gz

After the displays went black ~where X would start during startup.

after-freeze-390.25-nvidia-bug-report.log.gz

After attempting to reboot via ssh the machine would freeze until hard power cycled.
after-freeze-390.25-nvidia-bug-report.log.gz (148 KB)
post-Xload-390.25-nvidia-bug-report.log.gz (148 KB)

I’m wondering if nvidia can reproduce this or if it’s being tracked internal for either issue?

Appears we have some more data points and a possible temp solution for this issue.

We recently bought an HP Z4 for testing and our logout/login test does not freeze if you’re not using UEFI. The RAID card we had installed had issues that forces us to perform a legacy BIOS installation.

We have since fixed the RAID card and we can now perform UEFI installations. Running the same test the freeze returns.

Any idea what factors could be in-play since we only see this issue in UEFI?

It’s possible that legacy booting will be the default for any multi-GPU systems however if the recent news that BIOS booting will be removed by 2020 I don’t think this should be a solution.

See the following nvidia bug report from #19 test. Updated the kernel and tried the latest long lived driver 390.42
390.42-nvidia-bug-report.log.gz (145 KB)