Nvidia driver may not work with X server

"We have an issue of implementing NVIDIA driver 450 on RHEL 7.9 servers.

The nvidia card information:

[root@ai-hpcirfprd1 ~]# lspci | grep VGA

05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)

A nvidia driver was installed:

[root@ai-hpcirfprd1 ~]# lsmod | grep nvidia

nvidia_drm 48606 0

nvidia_modeset 1176938 2 nvidia_drm

nvidia 19658222 30 nvidia_modeset

drm_kms_helper 186531 1 nvidia_drm

drm 456166 3 drm_kms_helper,nvidia_drm

[root@ai-hpcirfprd1 ~]# nvidia-smi

Mon Apr 18 13:04:13 2022

±----------------------------------------------------------------------------+

| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |

|-------------------------------±---------------------±---------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 GeForce GTX TIT… On | 00000000:05:00.0 Off | N/A |

| 22% 43C P8 29W / 250W | 1MiB / 12212MiB | 0% Default |

| | | N/A |

±------------------------------±---------------------±---------------------+

| 1 Tesla K40c On | 00000000:22:00.0 Off | 0 |

| 23% 32C P8 22W / 235W | 0MiB / 11441MiB | 0% Default |

| | | N/A |

±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=============================================================================|

| No running processes found |

±----------------------------------------------------------------------------+

However, it does not show any graphical driver associated with nvidia card:

[root@ai-hpcirfprd1 ~]# xrandr --listproviders

Providers: number : 0

And dmesg showed nvidia module caused kernel taint:

[root@ai-hpcirfprd1 ~]# dmesg | grep taint

[ 15.196362] nvidia: loading out-of-tree module taints kernel.

[ 15.196374] nvidia: module license ‘NVIDIA’ taints kernel.

[ 15.196376] Disabling lock debugging due to kernel taint

[ 15.300715] nvidia: module verification failed: signature and/or required key missing - tainting kernel

Could you please let me know what is wrong here? Or I was missing something. And let me know if you need more information."

There doesn’t seem to be any Xservers started.

I tried to start X server with startx and there are errors in Xorg.0.log:

grep EE /var/log/Xorg.0.log

(WW) warning, ( EE ) error, (NI) not implemented, (??) unknown.

[ 142.751] ( EE ) NVIDIA(G0): GPU screens are disabled

[ 142.751] ( EE ) NVIDIA(G0): Failing initialization of X screen

I used Nvidia-xconfig to create the xorg.conf:

Section “Screen”
Identifier “Screen0”
Device “Device0”
Monitor “Monitor0”
DefaultDepth 24
Option “AllowEmptyInitialConfiguration” “True”
SubSection “Display”
Depth 24
EndSubSection
EndSection

There is only one VGA card:

lspci | grep VGA

05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)

Any idea why
( EE ) NVIDIA(G0): GPU screens are disabled
( EE ) NVIDIA(G0): Failing initialization of X screen

Thanks,

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

I have the nvidia-bug-report.log.gz file upload
nvidia-bug-report.log.gz (1.7 MB)
ed. Some messages like the following in Xorg.0.log:

(II) NVIDIA(0): NVIDIA GPU GeForce GTX TITAN X (GM200-A) at PCI:5:0:0 (GPU-0)

(–) NVIDIA(0): No enabled display devices found; starting anyway because
(–) NVIDIA(0): AllowEmptyInitialConfiguration is enabled

(EE) NVIDIA(G0): GPU screens are disabled

(EE) NVIDIA(G0): Failing initialization of X screen

It is also shown some issue when run glxinfo

-bash-4.2$ glxinfo
name of display: localhost:11.0
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 149 (GLX)
Minor opcode of failed request: 24 (X_GLXCreateNewContext)
Value in failed request: 0x0
Serial number of failed request: 19
Current serial number in output stream: 20

The logs show that the Xserver is running fine on the nvidia gpus. No errors.

BTW, you ran glxinfo over ssh with X redirection so you queried your client.

Do I need to start X if I run glx on the console? What is an appropriate way to start X server? What is used for start X?

Installing and enabling a display manager like lightdm, gdm… and a DE is the usual way. This depends on your use-case though.

Do you know which version of Nvidia driver support this Nvidia card:

lspci -vnn | grep VGA

05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [ VGA controller])

It’s supported up to the latest 510 driver.

I started gdm service. Does gdm use Nvidia driver? If not, which X server uses NVIDIA Driver? I saw these:

[ 58567.825] (II) No input driver specified, ignoring this device.

[ 58567.825] (II) This device may have been added with another device file.

You configured your xorg.conf to use the nvidia, so yes.

Although startx does not cause EE in Xorg.0.log, there is no provider:

xrandr --listproviders

Providers: number : 0

Why is that? Maybe the setting covered the EE:
Section “ServerLayout”
Identifier “layout”
Option “AllowNVIDIAGPUScreens”
EndSection

Please see this:
https://forums.developer.nvidia.com/t/headless-server-blank-screen-before-login/211605/3?u=generix

Got it. Thanks.