Dual NVIDIA P600 BaseMosaic freezes system if booting from UEFI with 390.42 during X restart

dereksybau8 · April 13, 2018, 7:39pm

If nvidia devel is reading this forum we have an internal Red Hat support case 02067698 for this issue. Also Dell and HP are tracking this problem internally with the same Red Hat support case number.

dereksybau8 · April 30, 2018, 6:31pm

See attached bug report from RHEL 7.5 and 390.48

Also including the backtraces from messages that nvidia-bug-report excludes.
nvidia-bug-report.log.gz (146 KB)
messages.txt (19.6 KB)

sandipt · June 14, 2018, 7:44am

Hi dereksybau8,
Can you please do remote ssh to system and collect nvidia bug report as soon as issue hit? Also check what error you see in dmesg when issue hit?

I think you are facing multiple issues. We will discuss one at a time. We want reproduce this issue internally so we need bug reproduction steps[step-by-step] . Also is it the issue only with Dell’s gpus and not with PNY’s gpu? Do you have Dell as well as PNY gpu installed in system? How many Dells and how many PNY’s? How many displays are connected to each GPU? Are you using any display dongle or cable like mDP-DP ? Are you running any test on system after login? Is just login and logout multiple time hit this issue where 5 displays configured in SLI config? IS the issue repro of no any application running on desktop and just do login and logout multiple times? Can you please share link for RedHat and HP tickets?

dereksybau8 · June 14, 2018, 9:06am

Yes, if possible I’d like to focus on why the system’s GPUs fail after a period of login/logout attempts when booting via UEFI.

I was looking to update this post on Thursday however I think I’ll make a note of additional testing we have performed over the past week. The GPU failure appears to have been fixed in the short lived driver 396. We’ve tested 396.18 and 396.24 and both do not cause the system to fail or freeze with the steps below. We still want to test this with RHEL 7.5, this should be done by the end of day.

However, https://devtalk.nvidia.com/default/topic/1032725/linux/396-18-02-neon-sddm-crash-on-boot-xid-62-nvrm-rm_init_adapter-failed-for-device-bearing-min-/post/5262427/#5262427 recommends to not use the short live driver. I would agree. Either a new long lived driver needs release or 390 needs patched with whatever fixed this in 396.18.

So until we have an updated long live driver (that has been tested) we still consider this an issue.

I’ll check to see if we can ssh to the machine after the freeze. Stay tuned for any updates tomorrow.

Perform a simple GUI installation of RHEL 7.4 or RHEL 7.5

If you're using RHEL 7.4 be sure to upgrade gdm to at least gdm-3.22.3-12.el7

Install nvidia driver. We've tested 384.111 and all 390 versions
Configure an xorg.conf with nvidia-settings that uses BaseMosaic with 2 GPUs
Create a user to login with
Configure this user to be used with gdm's TimedLogin

In /etc/gdm/custom.conf [daemon] section:

TimedLogin=USERYOUCREATED
TimedLoginEnable=true
TimedLoginDelay=5

In the USERYOUCREATED home directory create a simple shell script ~/logout:

#!/usr/bin/env bash
bash -c "xmessage 'Click to stop'; kill $$" &
sleep 15
gnome-session-quit --no-prompt --logout

chmod +x ~/logout
Run gnome-session-properties and add that script as a start up script
Reboot
gdm should now login after 5 seconds, the user should automatically run ~/logout and logout after 15 seconds. This should repeat.
Wait for failure. This could be a soon as a couple minutes. However, it's best to let it run until gdm can't start

Due to a bug in gdm if your GUI session gets to a point where you have a black background and an X cursor you have passed the test. This normally takes ~4 hours or ~400-500 login/logout attempts.

So far this doesn’t appear to be related to any one vendor. We have reproduced this on multiple Dell and HP configurations as well as multiple Dell, HP, and PNY branded cards. This issue does not appear to be related to any one vendor.

Far as I can remember no. We never had a combination of PNY, Dell, or HP cards in a single system.
We have always used all PNY, all Dell, or all HP cards when testing.

We have tested with multiple configurations with a different combination of displays connected to each GPU. This was already covered and answered from comment #6. 3x2, 2x3, 4x1, 1x4

In some cases were are using mDP-DP dongles. If I remember we have tested single DP to mDP cables with out the need for dongles.

No.

We have reproduced this with 5 and 6 displays connected.
Far as we can tell the issue occurs if you’re using BaseMosaic so in theory this could be triggered with two displays one connected to each GPU.
FYI, We’re not using SLI unless if you want categorize BaseMosaic as SLI.

More or less yes. See steps above. No additional applications are ran after login. Just a simple click to stop, sleep, and gnome logut via a command line utility.

You can find the Red Hat ticket in comment #21, 02067698.

I don’t have a ticket number for Dell or HP support. This has been simple emails between us and the vendor reps. The reps are working with their own internal teams.

dereksybau8 · June 14, 2018, 10:52pm

See attached nvidia-debug-report ~5 minutes after the GPUs fail and displays go black. This test took ~3 hours until the GPUs failed. Before that it was ~15 minutes of the above test.

I was still able to ssh to the machine however as pointed out in comment #16 [url]https://devtalk.nvidia.com/default/topic/1028914/linux/dual-nvidia-p600-basemosaic-freezes-system-if-booting-from-uefi-with-390-42-during-x-restart/post/5237040/#5237040[/url] any rebooting or runlevel changing would freeze the machine to a point where network would go down and a hard reset (holding the power button) was required to reboot.
nvidia-bug-report.log.gz (179 KB)

dereksybau8 · June 15, 2018, 2:00pm

Testing RHEL 7.5 we see the same results with 396.26. The GPUs do not fail and our tests pass.

Is there any way to id what fixes this issue in 396 and if/when it will be applied to a long live driver?

sandipt · June 18, 2018, 7:27am

That’s good news. We have been consulting with HP on your issue and the HP indicated they were waiting on some results of some HP test requests to help further investigate on their side. Can you please update HP and RedHat that issue is resolved for you?

dereksybau8 · June 18, 2018, 12:49pm

HP, Dell, and Red Hat have been informed however I think what we’re looking for now is what from the 396 release notes fixes this issue and a possible time frame when it will be included in a long live release.

sandipt · July 4, 2018, 8:59am

Hi dereksybau8,
Can I get logs as soon as issue hit? In the log you provided I see belwo error in dmesg :

[10443.191618] nvidia-modeset: WARNING: GPU:1: Lost display notification (1:0x00000000); continuing.
[10680.050252] INFO: task nvidia-modeset:369 blocked for more than 120 seconds.
[10680.050258] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[10680.050261] nvidia-modeset D ffff97695eaf3f40 0 369 2 0x00000000
[10680.050268] Call Trace:
[10680.050290] [] schedule+0x29/0x70
[10680.050295] [] schedule_timeout+0x239/0x2c0
[10680.050300] [] ? schedule_timeout+0x239/0x2c0
[10680.050337] [] ? _nv000813kms+0xd/0x90 [nvidia_modeset]
[10680.050369] [] ? _nv000793kms+0xd/0x30 [nvidia_modeset]
[10680.050374] [] ? __slab_free+0x81/0x2f0
[10680.050381] [] __down_common+0xaa/0x104
[10680.050391] [] ? cascade+0x30/0xc0
[10680.050398] [] __down+0x1d/0x1f
[10680.050404] [] down+0x41/0x50
[10680.050419] [] nvkms_kthread_q_callback+0x4b/0xe0 [nvidia_modeset]
[10680.050567] [] _main_loop+0x91/0x190 [nvidia]
[10680.050681] [] ? nv_kthread_q_init+0x120/0x120 [nvidia]
[10680.050688] [] kthread+0xd1/0xe0
[10680.050697] [] ? insert_kthread_work+0x40/0x40
[10680.050704] [] ret_from_fork_nospec_begin+0x21/0x21
[10680.050710] [] ? insert_kthread_work+0x40/0x40

In your log I have not seen NVRM: RmInitAdapter failed or NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:3:0:0. error. When did you see this error? Also /bin/nvidia-smi --query ran okay on your setup after the issue hit. So I think GPU is not lost. What did you see on the graphical desktop when issue hit? Can you share video or photo?

dereksybau8 · July 4, 2018, 6:59pm

The RmInitAdapter, “Failed to initialize” and “Xid error 56” errors didn’t always appear. However, the Xid error 56 seems very related to the current give away “Lost display notification”. During some of the init. testing in early 2018 (see comments 1 to 15) with 384, 387, and maybe early 390 one GPU would fail during the login/logout test. Then during the next login/logout the GPU would return. I believe this was when we would see the RmInitAdapater and Failed to initialize messages.

For the latest drivers it seems the tell is during “Freed GPU:” and “Allocated GPU:” phase (this is us logging in and out with the login/logout test from comment #24) getting “WARNING: GPU:N: Lost display notification” message.

You see this in the last set of logs:

...
Jun 14 13:01:12 mtacws005 kernel: nvidia-modeset: Allocated GPU:0 (GPU-c3e52699-2412-a293-31f0-3b0f7d07896d) @ PCI:0000:02:00.0
Jun 14 13:01:12 mtacws005 kernel: nvidia-modeset: Allocated GPU:1 (GPU-841588c4-0274-3ddc-65ea-9b06d75b4848) @ PCI:0000:03:00.0
Jun 14 13:01:16 mtacws005 kernel: nvidia-modeset: WARNING: GPU:1: Lost display notification (1:0x00000000); continuing.
Jun 14 13:05:13 mtacws005 kernel: INFO: task nvidia-modeset:369 blocked for more than 120 seconds.
Jun 14 13:05:13 mtacws005 kernel: nvidia-modeset  D ffff97695eaf3f40     0   369      2 0x00000000
Jun 14 13:07:13 mtacws005 kernel: INFO: task nvidia-modeset:369 blocked for more than 120 seconds.
Jun 14 13:07:13 mtacws005 kernel: nvidia-modeset  D ffff97695eaf3f40     0   369      2 0x00000000
...

After that message is when the displays would go black entering sleeping mode and the systems displays would stop working. Nothing was displayed not even a text console. As mentioned before this normal takes ~15-20 minutes from the start of our test. However, the last test took ~3 hours.

We can still ssh to the machine. Any type of run level change or restarting of the machine fails. A hard restart of the machine is required. This seems related to this post https://devtalk.nvidia.com/default/topic/1032564/linux/-strange-partial-workaround-nvidia-modeset-crash-on-changing-virtual-terminal/post/5253441/#5253441

No photo or video. The testing hardware has been decommissioned and might take a couple weeks to repurpose any existing hardware to test with. Are you having any issues reproducing this locally?

dereksybau8 · July 4, 2018, 7:04pm

Also, something to keep in mind some of our nvidia bug reports have come from multiple pieces of hardware. Some Dell towers, some HP towers, some PNY, some HP branded, some Dell installed P600 cards. This might explain some of the small differences between the logs.

However, all have exhibited approximately the same behavior with either one GPU failing during the logout/login test or totally GPU failure with the displays going black and entering sleep mode.

Topic		Replies	Views
Black screen after install of nvidia driver ubuntu Linux	224	159922	February 27, 2025
[Regression 460 series] Black screen on boot: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer Linux	64	21480	January 7, 2024
BLACK SCREEN at the reboot after install nVidia driver 375.26 Geforce GT630 Ubuntu 16.04 64bit... Linux	40	32496	April 21, 2017
Complete freeze with nvidia-prime Linux	35	17777	May 18, 2018
DP1.2 connected monitor cannot be turned on again with DPMS Linux	56	20409	July 26, 2018
black screen at desktop login [GTX 750 Ti] [390.25] Linux	36	15783	October 14, 2021
Recovered GPU Errors in nvidia-settings Linux	10	19997	October 10, 2014
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	10607	March 18, 2025
resume from suspend freezes system (GTX 970, Arch Linux, Kernel 4.4/4.7, NVIDIA 370) Linux	171	58230	June 18, 2017
System seems locked while rebooting with Linux 5.2.1 and nvidia drivers 430.34 or 430.26 Linux	80	6455	November 11, 2019

Dual NVIDIA P600 BaseMosaic freezes system if booting from UEFI with 390.42 during X restart

Related topics