No EDID found with driver 390.xx and RHEL 7.9

I have a multiseatX setup using 2 Nvidia GTX 730 cards with RHEL 7.9 and the Nvidia 390.144 driver. Occasionally, one seat fails to get a response from the monitor and the screen fails to displays my X session (KDE). It rotates between which screen fails to come up, but most of the time it’s the 2nd seat screen. I can get to a virtual terminal with CTRL+ALT+F2 and log in when the failure occurs, so I know the monitor is alive. Looking at which processes are running I can see that X is running twice, once for each seat. The lightdm logs confirm that, but the screen is blank. Looking in the Xorg.X.log I see the following:

[ 50.515] (II) NVIDIA(0): Validated MetaModes:
[ 50.515] (II) NVIDIA(0): “NULL”
[ 50.515] (II) NVIDIA(0): Virtual screen size determined to be 640 x 480
[ 50.515] (WW) NVIDIA(0): Unable to get display device for DPI computation.
[ 50.515] (==) NVIDIA(0): DPI set to (75, 75); computed from built-in default
[ 50.525] (II) NVIDIA: Using 6144.00 MB of virtual memory for indirect memory
[ 50.525] (II) NVIDIA: access.
[ 50.584] (II) NVIDIA(0): Setting mode “NULL”

When the monitor comes up properly I see the following instead:

[ 50.566] (–) NVIDIA(GPU-0): CRT-0: disconnected
[ 50.566] (–) NVIDIA(GPU-0): CRT-0: 400.0 MHz maximum pixel clock
[ 50.566] (–) NVIDIA(GPU-0):
[ 50.573] (–) NVIDIA(GPU-0): CRT-1: disconnected
[ 50.573] (–) NVIDIA(GPU-0): CRT-1: 400.0 MHz maximum pixel clock
[ 50.573] (–) NVIDIA(GPU-0):
[ 50.576] (–) NVIDIA(GPU-0): DFP-0: disconnected
[ 50.576] (–) NVIDIA(GPU-0): DFP-0: Internal TMDS
[ 50.576] (–) NVIDIA(GPU-0): DFP-0: 330.0 MHz maximum pixel clock
[ 50.576] (–) NVIDIA(GPU-0):
[ 50.603] (–) NVIDIA(GPU-0): DELL U2713H (DFP-1): connected
[ 50.603] (–) NVIDIA(GPU-0): DELL U2713H (DFP-1): Internal TMDS
[ 50.603] (–) NVIDIA(GPU-0): DELL U2713H (DFP-1): 330.0 MHz maximum pixel clock
[ 50.603] (–) NVIDIA(GPU-0):
[ 50.603] (–) NVIDIA(GPU-0): DFP-2: disconnected
[ 50.603] (–) NVIDIA(GPU-0): DFP-2: Internal TMDS
[ 50.603] (–) NVIDIA(GPU-0): DFP-2: 165.0 MHz maximum pixel clock
[ 50.603] (–) NVIDIA(GPU-0):
[ 50.609] (WW) NVIDIA(0): No valid modes for “2048x1536_60+0+0”; removing.
[ 50.609] (WW) NVIDIA(0):
[ 50.609] (WW) NVIDIA(0): Unable to validate any modes; falling back to the default mode
[ 50.609] (WW) NVIDIA(0): “nvidia-auto-select”.
[ 50.609] (WW) NVIDIA(0):
[ 50.610] (II) NVIDIA(0): Validated MetaModes:
[ 50.610] (II) NVIDIA(0): “DFP-1:nvidia-auto-select”
[ 50.610] (II) NVIDIA(0): Virtual screen size determined to be 2560 x 1440
[ 50.633] (–) NVIDIA(0): DPI set to (108, 107); computed from “UseEdidDpi” X config
[ 50.633] (–) NVIDIA(0): option
[ 50.634] (II) NVIDIA: Using 6144.00 MB of virtual memory for indirect memory
[ 50.634] (II) NVIDIA: access.
[ 50.663] (II) NVIDIA(0): Setting mode “DFP-1:nvidia-auto-select”

When it fails lspic shows the card is there but xrandr doesn’t return any outputs on the video card. I haven’t been able to find any debug logging that can help. I did configure lightdm to launch X with “-verbose 6 -logverbose 6”, but it didn’t really provide anything useful, except for the following few lines in the Xorg.X.log:

Found 0 head on board

NoScanout X screen configured with resolution 640 x 480

This setup works just fine on RHEL 7.7 but occasionally fails on RHEL 7.9 (1 out of every 10-12 boots at least, though as frequently as 20% of the time). The failure can occur on a fresh boot or on a reboot. Here’s a handful of things I have tried and the failure still occurs:

  1. Using lightdm
  2. Using gdm
  3. With or without auto login
  4. Used Nvidia driver 390.116 from my RHEL 7.7 install
  5. Used Nvidia driver 390.144 (current)
  6. Downgraded RHEL 7.9 to use xorg-x11-server-Xorg.1.20.4-7 from my RHEL 7.7 install
  7. Downgraded RHEL7.9 to use kernel-3.10.0-1062.4.1 from my RHEL 7.7 install
  8. Downgraded RHEL7.9 to use mesa-dri-drivers-18.3.4-5 from my RHEL 7.7 install (Since I’m using the Nvidia driver I don’t think mesa matters much here.)
  9. Used an EDID Emulator

There appears to be some sort of timing issue because only fails occasionally, and when it does fail I can manually run a ‘systemctl restart lightdm’ (or ‘systemctl restart gdm’) and it fixes the issue. So restarting X somehow triggers a re-query of the monitors and they come up. I did try the nouveau driver and was not able to see the issue with it which could suggest an issue with the Nvidia driver. However, using the same Nvidia driver from my RHEL 7.7 install (see #4 above) it still failed, which suggests to me something inside RHEL changed that is causing this issue.

I asked Redhat for help and they suggested setting the following options in the xorg.conf file. I did set these options and still had the failure occur. The last two also caused a resolution of 1024x768.

    Option "ModeValidation" "AllowNonEdidModes"
    Option "UseEDIDFreqs" "FALSE"
    Option "UseEDIDDpi" "FALSE"

Since Nvidia does the EDID handling Redhat asked I ask here for help.

nvidia-bug-report.log.gz (568.5 KB)

Thanks for any help.

That’s a recurring bug where the nvidia driver thinks the gpu has no heads, i.e. monitor connectors. Origin unknown, unfortunately. You might try if using kernel parameter nvidia-drm.modeset=1 makes it more reliable.

Thanks, I’ll give it a try. I’ve used this setup for several years and never had a problem. As I mentioned, the nvidia 390.116 driver works fine on RHEL 7.7 but has this issue on RHEL 7.9, which to me would indicate there’s some sort of handoff/timing issue on the RHEL side that is affecting the nvidia driver.