390.154 with Xorg 1.20.11 on Rocky Linux 9 "No Devices Found" and GetCaptureBufferSize failed

Installation of 390.154 on CentOS7 works without any problems at all. X.org starts fine and detects my GPU.

However, with a fresh installation of Rocky Linux 9 things are not looking as good. The driver installation completes without any errors, but when setting graphical.target, /var/log/Xorg.0.log reports:

[  2577.261] (II) NVIDIA dlloader X Driver  390.154  Wed Jun 22 04:48:53 UTC 2022
[  2577.261] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[  2577.263] (EE) No devices detected.
[  2577.264] (EE) 
Fatal server error:
[  2577.264] (EE) no screens found(EE)

I compared some stats from /proc/driver/nvidia/gpus/0000:02:00.0/information and with CentOS7:

Model: 		 Quadro 2000
IRQ:   		 32
GPU UUID: 	 GPU-b2085c8f-4dd7-a830-546c-d48d3a2e2e2e
Video BIOS: 	 70.06.4b.00.05
Bus Type: 	 PCIe
DMA Size: 	 40 bits
DMA Mask: 	 0xffffffffff
Bus Location: 	 0000:02:00.0
Device Minor: 	 0

And Rocky 9:

Model: 		 Quadro 2000
IRQ:   		 38
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 40 bits
DMA Mask: 	 0xffffffffff
[nvidia-bug-report.log.gz|attachment](upload://hUN2NwWDNV4FW6gcFiRt8a5TDYt.gz) (546.4 KB)

Bus Location: 	 0000:02:00.0
Device Minor: 	 0

That’s about the only differences I’ve found so for. I’m not sure what to make of this information, though.

I executed nvidia-bug-report.sh (attached) and found that right after /bin/nvidia-debugdump -Dthere are some errors reported:

ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7

I am uncertain if this is an issue with the driver package itself, or a kernel issue.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

nvidia-bug-report.log.gz (546.4 KB)
Hello @generix

Oh sorry, I thought I did that in my original post, something must have gone wrong.
I’m trying to attach it here.

Kind regards,
Martin

The driver seems to be fine, just xorg seems to be blocked from using it. Tried disabling SELinux?

I’ve tried with setenforce 0 and also to more permanently by updating the kernel parameters and a reboot
(grubby --update-kernel ALL --args selinux=0)

To no avail…

Please post the output of
ls -l /dev/nvidi*

Oddly, If I check after a fresh reboot of the server, the device is not created:

[root@compute1 martin]# ls -l /dev/nvidi*
ls: cannot access ‘/dev/nvidi*’: No such file or directory

If I execute, for example, nvidia-smi once first

[root@compute1 cendio]# nvidia-smi 
Tue Sep 27 14:08:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.154                Driver Version: 390.154                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000         Off  | 00000000:02:00.0 Off |                  N/A |
| 30%   46C    P0    N/A /  N/A |      0MiB /   964MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

and then they are created:

[root@compute1 cendio]# ls /dev/nvi*
/dev/nvidia0  /dev/nvidiactl

dmesg output after nvidia-smi:

[  269.721744] resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[  269.721839] caller _nv027866rm+0x58/0x90 [nvidia] mapping multiple BARs
[  269.820828] resource sanity check: requesting [mem 0x000a0000-0x000bffff], which spans more than PCI Bus 0000:20 [mem 0x000a0000-0x000b0000 window]
[  269.820912] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[  269.821521] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[  269.821601] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs

Still same errors in Xorg.0.log though.

Please check whether the modules are loaded after a boot
lsmod |grep nvidia
Please post the output of
ls -l /dev/nvidi*
the permissions are important.

[root@compute1 ~]# uptime
 14:47:59 up 1 min,  1 user,  load average: 1.13, 0.48, 0.18


[root@compute1 ~]# lsmod |grep nvidia
nvidia_drm             57344  0
nvidia_modeset       1060864  1 nvidia_drm
nvidia              15892480  1 nvidia_modeset
drm_kms_helper        311296  1 nvidia_drm
drm                   634880  3 drm_kms_helper,nvidia_drm
ipmi_msghandler       126976  2 ipmi_devintf,nvidia


[root@compute1 ~]# ls -l /dev/nvidi*
ls: cannot access '/dev/nvidi*': No such file or directory


[root@compute1 ~]# nvidia-smi 
Tue Sep 27 14:48:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.154                Driver Version: 390.154                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000         Off  | 00000000:02:00.0 Off |                  N/A |
| 30%   45C    P0    N/A /  N/A |      0MiB /   964MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


[root@compute1 ~]# lsmod |grep nvidia
nvidia_drm             57344  0
nvidia_modeset       1060864  1 nvidia_drm
nvidia              15892480  1 nvidia_modeset
drm_kms_helper        311296  1 nvidia_drm
drm                   634880  3 drm_kms_helper,nvidia_drm
ipmi_msghandler       126976  2 ipmi_devintf,nvidia


[root@compute1 ~]# ls -l /dev/nvidi*
crw-rw-rw-. 1 root root 195,   0 Sep 27 14:48 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Sep 27 14:48 /dev/nvidiactl

Please try creating /etc/modprobe.d/nvidia.conf

options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0666 NVreg_ModifyDeviceFiles=1

I’m not able to test that right now.
I had to install RHEL8 to verify some things, and there the same driver package works just fine.

Are you interested in any logs collected from this installation for reference?

No, that won’t help.

Hello again @generix

Server reinstalled with Rocky9 again, and same issue persist after your recommendation to the modprobe.d/nvidia.conf