"Switch User" causes X server crash and input device lockup

Hello,

On Red Hat Enterprise Linux 7 (Workstation), we have observed an intermittent problem (but with consistent symptoms) affecting usage of the GNOME Display Manager’s “Switch User” functionality.

Consider the following procedure:

  1. Boot into RHEL7
  2. Log in using a smart card (US Department of Defense Common Access Card (CAC), specifically)
  3. Wait for the desktop environment to finish loading -- typically, we have seen this when using GNOME Classic, but we also have reports of it occurring for KDE users
  4. Remove the smart card, which (thanks to the smart card support provided by Centrify Infrastructure Services) causes the screen to lock
  5. Choose the "Log in as another user" option in the GDM login window
  6. Log back in as the same user, or as a different user, again using CAC

Sporadically, this procedure will cause the X server to crash (as identified by Xorg.N.log) and the screen to go black. However, GDM doesn’t restart, nor does the system drop to a tty, and the keyboard and mouse refuse to work at the local console. This is not a full system crash, however, because it is still possible to SSH into the machine.

This has happened numerous times, and I have included two nvidia-bug-report.log.gz files from two different occurrences in subsequent replies to this topic.

Perhaps it is helpful to consider the “timeline” for this problem:

  • Problem first begins to be observed on machines with the 384.98 driver
  • 2017-Nov: After a few occurrences, the 2017-11-23 bug report is captured (along with a core dump of the gsettings-data daemon)
  • 2017-Dec: A few system changes are made, including the installation of the acpid software (for reasons unrelated to the X server crash)
  • 2017-Dec: Sometime shortly before or after the above, a concerted effort is made to reproduce the bug: the procedure is repeated 50 times in a row but with no crash (where previous efforts to reproduce the bug required less than 10)
  • 2017-Jan: NVidia driver updated to 384.111, and the bug continues to be hard to reproduce
  • 2018-Feb: NVidia driver updated to 390.25
  • 2018-Feb (late): Once again, the bug is reproduced and the 2018-02-22 bug report is captured

As you can see, there appears to be little correlation between the driver version and the bug’s reproducibility. Even the nvidia-bug-report.log.gz files seem to be somewhat different, with the first (from 2017-Nov) implicating the nvidia driver in the stack trace but the second (from 2018-Feb) lacking this connection. Can anybody help point us in the direction we should look next?
2017-11-23_nvidia-bug-report.log.gz (222 KB)
2018-02-22_nvidia-bug-report.log.gz (149 KB)

This is not a full system crash, however, because it is still possible to SSH into the machine

I see. Similar behavior:

https://bugs.gentoo.org/649298

All the logs are attached there to their Bugzilla. I think it is the same bug.

Thank you for your response, Mike_Z. There are definite similarities between the Gentoo bug (which is cross-posted on this forum) and the issue which I have posted. However, I believe that there are a few significant aspects which distinguish my symptoms from the Gentoo bug — which, in turn, leads me to believe that the underlying cause (and fix!) could be unrelated:

#1: Timing of the system hang

In the Gentoo bug, LightDM either did not start or failed to display output immediately after a system boot, whereas the bad behavior which I have described on RHEL7 occurs during a “Switch User” operation. It is possible that the RHEL7 hang occurs during initial boot but I have not witnessed it, given that the problem only happens intermittently to begin with.

#2: Locations and types of messages

In the Gentoo bug, the following errors/warnings related to the NVidia driver/device were all observed:

  • kernel: NVRM: Your system is not currently configured to drive a VGA console...
  • kernel: caller _nv001170rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
  • kernel: nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
  • kernel: nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000917e:0:0
  • lightdm: X Error of failed request: BadValue (integer parameter out of range for operation)
  • kernel: BUG: unable to handle kernel paging request at ffff922fd7595880 with nvidia location afterwards
  • kernel: Oops: 0000 [#1] SMP PTI
  • kernel: Fixing recursive fault but reboot is needed!

With the exception of the “not currently configured to drive a VGA console” message, none of these messages were produced on my RHEL7 system. In their place, consider the following warnings/errors extracted from the first nvidia-bug-report:

  • kernel: NVRM: Your system is not currently configured to drive a VGA console...
  • Xorg.0.log: Backtrace starting in /usr/bin/X and ending in /usr/lib64/xorg/modules/drivers/nvidia_drv.so
  • Xorg.0.log: Segmentation fault at address 0x55f2bc8ee6d0
  • Xorg.0.log: Caught signal 11 (Segmentation fault). Server aborting

#3: Miscellaneous differences in system configuration

There are a few obvious (but not necessarily meaningful!) differences in distro (Gentoo vs. RHEL7), kernel version (4.15.7 vs. 3.10.0), display manager (LightDM vs. GDM), particular NVidia card (GeForce GTX 960 vs. Quadro K620), the Gentoo and linked bugs seem to all involve Intel & Optimus setup whereas mine is on a desktop without Optimus technology, etc.

Your logs are not very consistent, the november log has an X server crash, the febuary not.
My interpretation would be that gdm fails to spawn a new X session but I can’t see why since those errors would be in journal. That’s not included in the nvidia logs.
The first case looks like the gdm x server crashed as soon as you clicked ‘log in as another user’ thus switching vt. This should be easier to trigger by switching from VT2 to VT7 and back if that bug still occurs in current drivers.
The second case looks like both user and gdm x session are still alive but gdm failed to spawn a new session or switch to the existing one.
Hard to tell what exactly is happening, better also check journal on the next occurance.

“Switch user” hangs system up even without smart card on Ubuntu 18.04 (GNOME 3.28.1-0ubuntu1, nvidia-driver 390.48-0ubuntu2). I have 2 users and can never switch them – system always 100% locked and only reset button can help. So it seems to be a more common issue for Nvidia and Linux friendship.
Xorg.1.log (10.1 KB)