If nvidia devel is reading this forum we have an internal Red Hat support case 02067698 for this issue. Also Dell and HP are tracking this problem internally with the same Red Hat support case number.
See attached bug report from RHEL 7.5 and 390.48
Also including the backtraces from messages that nvidia-bug-report excludes.
nvidia-bug-report.log.gz (146 KB)
messages.txt (19.6 KB)
Hi dereksybau8,
Can you please do remote ssh to system and collect nvidia bug report as soon as issue hit? Also check what error you see in dmesg when issue hit?
I think you are facing multiple issues. We will discuss one at a time. We want reproduce this issue internally so we need bug reproduction steps[step-by-step] . Also is it the issue only with Dell’s gpus and not with PNY’s gpu? Do you have Dell as well as PNY gpu installed in system? How many Dells and how many PNY’s? How many displays are connected to each GPU? Are you using any display dongle or cable like mDP-DP ? Are you running any test on system after login? Is just login and logout multiple time hit this issue where 5 displays configured in SLI config? IS the issue repro of no any application running on desktop and just do login and logout multiple times? Can you please share link for RedHat and HP tickets?
Yes, if possible I’d like to focus on why the system’s GPUs fail after a period of login/logout attempts when booting via UEFI.
I was looking to update this post on Thursday however I think I’ll make a note of additional testing we have performed over the past week. The GPU failure appears to have been fixed in the short lived driver 396. We’ve tested 396.18 and 396.24 and both do not cause the system to fail or freeze with the steps below. We still want to test this with RHEL 7.5, this should be done by the end of day.
However, https://devtalk.nvidia.com/default/topic/1032725/linux/396-18-02-neon-sddm-crash-on-boot-xid-62-nvrm-rm_init_adapter-failed-for-device-bearing-min-/post/5262427/#5262427 recommends to not use the short live driver. I would agree. Either a new long lived driver needs release or 390 needs patched with whatever fixed this in 396.18.
So until we have an updated long live driver (that has been tested) we still consider this an issue.
I’ll check to see if we can ssh to the machine after the freeze. Stay tuned for any updates tomorrow.
- Perform a simple GUI installation of RHEL 7.4 or RHEL 7.5
- If you're using RHEL 7.4 be sure to upgrade gdm to at least gdm-3.22.3-12.el7
- Install nvidia driver. We've tested 384.111 and all 390 versions
- Configure an xorg.conf with nvidia-settings that uses BaseMosaic with 2 GPUs
- Create a user to login with
- Configure this user to be used with gdm's TimedLogin
- In /etc/gdm/custom.conf [daemon] section:
- In the USERYOUCREATED home directory create a simple shell script ~/logout:
- chmod +x ~/logout
- Run gnome-session-properties and add that script as a start up script
- Reboot
- gdm should now login after 5 seconds, the user should automatically run ~/logout and logout after 15 seconds. This should repeat.
- Wait for failure. This could be a soon as a couple minutes. However, it's best to let it run until gdm can't start
TimedLogin=USERYOUCREATED
TimedLoginEnable=true
TimedLoginDelay=5
#!/usr/bin/env bash
bash -c "xmessage 'Click to stop'; kill $$" &
sleep 15
gnome-session-quit --no-prompt --logout
Due to a bug in gdm if your GUI session gets to a point where you have a black background and an X cursor you have passed the test. This normally takes ~4 hours or ~400-500 login/logout attempts.
So far this doesn’t appear to be related to any one vendor. We have reproduced this on multiple Dell and HP configurations as well as multiple Dell, HP, and PNY branded cards. This issue does not appear to be related to any one vendor.
Far as I can remember no. We never had a combination of PNY, Dell, or HP cards in a single system.
We have always used all PNY, all Dell, or all HP cards when testing.
We have tested with multiple configurations with a different combination of displays connected to each GPU. This was already covered and answered from comment #6. 3x2, 2x3, 4x1, 1x4
In some cases were are using mDP-DP dongles. If I remember we have tested single DP to mDP cables with out the need for dongles.
No.
We have reproduced this with 5 and 6 displays connected.
Far as we can tell the issue occurs if you’re using BaseMosaic so in theory this could be triggered with two displays one connected to each GPU.
FYI, We’re not using SLI unless if you want categorize BaseMosaic as SLI.
More or less yes. See steps above. No additional applications are ran after login. Just a simple click to stop, sleep, and gnome logut via a command line utility.
You can find the Red Hat ticket in comment #21, 02067698.
I don’t have a ticket number for Dell or HP support. This has been simple emails between us and the vendor reps. The reps are working with their own internal teams.
See attached nvidia-debug-report ~5 minutes after the GPUs fail and displays go black. This test took ~3 hours until the GPUs failed. Before that it was ~15 minutes of the above test.
I was still able to ssh to the machine however as pointed out in comment #16 [url]https://devtalk.nvidia.com/default/topic/1028914/linux/dual-nvidia-p600-basemosaic-freezes-system-if-booting-from-uefi-with-390-42-during-x-restart/post/5237040/#5237040[/url] any rebooting or runlevel changing would freeze the machine to a point where network would go down and a hard reset (holding the power button) was required to reboot.
nvidia-bug-report.log.gz (179 KB)
Testing RHEL 7.5 we see the same results with 396.26. The GPUs do not fail and our tests pass.
Is there any way to id what fixes this issue in 396 and if/when it will be applied to a long live driver?
That’s good news. We have been consulting with HP on your issue and the HP indicated they were waiting on some results of some HP test requests to help further investigate on their side. Can you please update HP and RedHat that issue is resolved for you?
HP, Dell, and Red Hat have been informed however I think what we’re looking for now is what from the 396 release notes fixes this issue and a possible time frame when it will be included in a long live release.
Hi dereksybau8,
Can I get logs as soon as issue hit? In the log you provided I see belwo error in dmesg :
[10443.191618] nvidia-modeset: WARNING: GPU:1: Lost display notification (1:0x00000000); continuing.
[10680.050252] INFO: task nvidia-modeset:369 blocked for more than 120 seconds.
[10680.050258] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[10680.050261] nvidia-modeset D ffff97695eaf3f40 0 369 2 0x00000000
[10680.050268] Call Trace:
[10680.050290] [] schedule+0x29/0x70
[10680.050295] [] schedule_timeout+0x239/0x2c0
[10680.050300] [] ? schedule_timeout+0x239/0x2c0
[10680.050337] [] ? _nv000813kms+0xd/0x90 [nvidia_modeset]
[10680.050369] [] ? _nv000793kms+0xd/0x30 [nvidia_modeset]
[10680.050374] [] ? __slab_free+0x81/0x2f0
[10680.050381] [] __down_common+0xaa/0x104
[10680.050391] [] ? cascade+0x30/0xc0
[10680.050398] [] __down+0x1d/0x1f
[10680.050404] [] down+0x41/0x50
[10680.050419] [] nvkms_kthread_q_callback+0x4b/0xe0 [nvidia_modeset]
[10680.050567] [] _main_loop+0x91/0x190 [nvidia]
[10680.050681] [] ? nv_kthread_q_init+0x120/0x120 [nvidia]
[10680.050688] [] kthread+0xd1/0xe0
[10680.050697] [] ? insert_kthread_work+0x40/0x40
[10680.050704] [] ret_from_fork_nospec_begin+0x21/0x21
[10680.050710] [] ? insert_kthread_work+0x40/0x40
In your log I have not seen NVRM: RmInitAdapter failed or NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:3:0:0. error. When did you see this error? Also /bin/nvidia-smi --query ran okay on your setup after the issue hit. So I think GPU is not lost. What did you see on the graphical desktop when issue hit? Can you share video or photo?
The RmInitAdapter, “Failed to initialize” and “Xid error 56” errors didn’t always appear. However, the Xid error 56 seems very related to the current give away “Lost display notification”. During some of the init. testing in early 2018 (see comments 1 to 15) with 384, 387, and maybe early 390 one GPU would fail during the login/logout test. Then during the next login/logout the GPU would return. I believe this was when we would see the RmInitAdapater and Failed to initialize messages.
For the latest drivers it seems the tell is during “Freed GPU:” and “Allocated GPU:” phase (this is us logging in and out with the login/logout test from comment #24) getting “WARNING: GPU:N: Lost display notification” message.
You see this in the last set of logs:
...
Jun 14 13:01:12 mtacws005 kernel: nvidia-modeset: Allocated GPU:0 (GPU-c3e52699-2412-a293-31f0-3b0f7d07896d) @ PCI:0000:02:00.0
Jun 14 13:01:12 mtacws005 kernel: nvidia-modeset: Allocated GPU:1 (GPU-841588c4-0274-3ddc-65ea-9b06d75b4848) @ PCI:0000:03:00.0
Jun 14 13:01:16 mtacws005 kernel: nvidia-modeset: WARNING: GPU:1: Lost display notification (1:0x00000000); continuing.
Jun 14 13:05:13 mtacws005 kernel: INFO: task nvidia-modeset:369 blocked for more than 120 seconds.
Jun 14 13:05:13 mtacws005 kernel: nvidia-modeset D ffff97695eaf3f40 0 369 2 0x00000000
Jun 14 13:07:13 mtacws005 kernel: INFO: task nvidia-modeset:369 blocked for more than 120 seconds.
Jun 14 13:07:13 mtacws005 kernel: nvidia-modeset D ffff97695eaf3f40 0 369 2 0x00000000
...
After that message is when the displays would go black entering sleeping mode and the systems displays would stop working. Nothing was displayed not even a text console. As mentioned before this normal takes ~15-20 minutes from the start of our test. However, the last test took ~3 hours.
We can still ssh to the machine. Any type of run level change or restarting of the machine fails. A hard restart of the machine is required. This seems related to this post https://devtalk.nvidia.com/default/topic/1032564/linux/-strange-partial-workaround-nvidia-modeset-crash-on-changing-virtual-terminal/post/5253441/#5253441
No photo or video. The testing hardware has been decommissioned and might take a couple weeks to repurpose any existing hardware to test with. Are you having any issues reproducing this locally?
Also, something to keep in mind some of our nvidia bug reports have come from multiple pieces of hardware. Some Dell towers, some HP towers, some PNY, some HP branded, some Dell installed P600 cards. This might explain some of the small differences between the logs.
However, all have exhibited approximately the same behavior with either one GPU failing during the logout/login test or totally GPU failure with the displays going black and entering sleep mode.