390.154 with Xorg 1.20.11 on Rocky Linux 9 "No Devices Found" and GetCaptureBufferSize failed

martin.ostlund · September 27, 2022, 8:35am

Installation of 390.154 on CentOS7 works without any problems at all. X.org starts fine and detects my GPU.

However, with a fresh installation of Rocky Linux 9 things are not looking as good. The driver installation completes without any errors, but when setting graphical.target, /var/log/Xorg.0.log reports:

[  2577.261] (II) NVIDIA dlloader X Driver  390.154  Wed Jun 22 04:48:53 UTC 2022
[  2577.261] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[  2577.263] (EE) No devices detected.
[  2577.264] (EE) 
Fatal server error:
[  2577.264] (EE) no screens found(EE)

I compared some stats from /proc/driver/nvidia/gpus/0000:02:00.0/information and with CentOS7:

Model: 		 Quadro 2000
IRQ:   		 32
GPU UUID: 	 GPU-b2085c8f-4dd7-a830-546c-d48d3a2e2e2e
Video BIOS: 	 70.06.4b.00.05
Bus Type: 	 PCIe
DMA Size: 	 40 bits
DMA Mask: 	 0xffffffffff
Bus Location: 	 0000:02:00.0
Device Minor: 	 0

And Rocky 9:

Model: 		 Quadro 2000
IRQ:   		 38
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 40 bits
DMA Mask: 	 0xffffffffff
[nvidia-bug-report.log.gz|attachment](upload://hUN2NwWDNV4FW6gcFiRt8a5TDYt.gz) (546.4 KB)

Bus Location: 	 0000:02:00.0
Device Minor: 	 0

That’s about the only differences I’ve found so for. I’m not sure what to make of this information, though.

I executed nvidia-bug-report.sh (attached) and found that right after /bin/nvidia-debugdump -Dthere are some errors reported:

ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7

I am uncertain if this is an issue with the driver package itself, or a kernel issue.

generix · September 27, 2022, 9:19am

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

martin.ostlund · September 27, 2022, 9:24am

nvidia-bug-report.log.gz (546.4 KB)
Hello @generix

Oh sorry, I thought I did that in my original post, something must have gone wrong.
I’m trying to attach it here.

Kind regards,
Martin

generix · September 27, 2022, 9:46am

The driver seems to be fine, just xorg seems to be blocked from using it. Tried disabling SELinux?

martin.ostlund · September 27, 2022, 10:56am

I’ve tried with setenforce 0 and also to more permanently by updating the kernel parameters and a reboot
(grubby --update-kernel ALL --args selinux=0)

To no avail…

generix · September 27, 2022, 11:06am

Please post the output of
ls -l /dev/nvidi*

martin.ostlund · September 27, 2022, 12:14pm

Oddly, If I check after a fresh reboot of the server, the device is not created:

[root@compute1 martin]# ls -l /dev/nvidi*
ls: cannot access ‘/dev/nvidi*’: No such file or directory

If I execute, for example, nvidia-smi once first

[root@compute1 cendio]# nvidia-smi 
Tue Sep 27 14:08:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.154                Driver Version: 390.154                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000         Off  | 00000000:02:00.0 Off |                  N/A |
| 30%   46C    P0    N/A /  N/A |      0MiB /   964MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

and then they are created:

[root@compute1 cendio]# ls /dev/nvi*
/dev/nvidia0  /dev/nvidiactl

dmesg output after nvidia-smi:

[  269.721744] resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[  269.721839] caller _nv027866rm+0x58/0x90 [nvidia] mapping multiple BARs
[  269.820828] resource sanity check: requesting [mem 0x000a0000-0x000bffff], which spans more than PCI Bus 0000:20 [mem 0x000a0000-0x000b0000 window]
[  269.820912] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[  269.821521] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[  269.821601] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs

Still same errors in Xorg.0.log though.

generix · September 27, 2022, 12:37pm

Please check whether the modules are loaded after a boot
lsmod |grep nvidia
Please post the output of
ls -l /dev/nvidi*
the permissions are important.

martin.ostlund · September 27, 2022, 12:43pm

[root@compute1 ~]# uptime
 14:47:59 up 1 min,  1 user,  load average: 1.13, 0.48, 0.18


[root@compute1 ~]# lsmod |grep nvidia
nvidia_drm             57344  0
nvidia_modeset       1060864  1 nvidia_drm
nvidia              15892480  1 nvidia_modeset
drm_kms_helper        311296  1 nvidia_drm
drm                   634880  3 drm_kms_helper,nvidia_drm
ipmi_msghandler       126976  2 ipmi_devintf,nvidia


[root@compute1 ~]# ls -l /dev/nvidi*
ls: cannot access '/dev/nvidi*': No such file or directory


[root@compute1 ~]# nvidia-smi 
Tue Sep 27 14:48:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.154                Driver Version: 390.154                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000         Off  | 00000000:02:00.0 Off |                  N/A |
| 30%   45C    P0    N/A /  N/A |      0MiB /   964MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


[root@compute1 ~]# lsmod |grep nvidia
nvidia_drm             57344  0
nvidia_modeset       1060864  1 nvidia_drm
nvidia              15892480  1 nvidia_modeset
drm_kms_helper        311296  1 nvidia_drm
drm                   634880  3 drm_kms_helper,nvidia_drm
ipmi_msghandler       126976  2 ipmi_devintf,nvidia


[root@compute1 ~]# ls -l /dev/nvidi*
crw-rw-rw-. 1 root root 195,   0 Sep 27 14:48 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Sep 27 14:48 /dev/nvidiactl

generix · September 27, 2022, 2:58pm

Please try creating /etc/modprobe.d/nvidia.conf

options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0666 NVreg_ModifyDeviceFiles=1

martin.ostlund · September 27, 2022, 3:55pm

I’m not able to test that right now.
I had to install RHEL8 to verify some things, and there the same driver package works just fine.

Are you interested in any logs collected from this installation for reference?

generix · September 27, 2022, 3:59pm

No, that won’t help.

martin.ostlund · September 28, 2022, 1:54pm

Hello again @generix

Server reinstalled with Rocky9 again, and same issue persist after your recommendation to the modprobe.d/nvidia.conf

Topic		Replies	Views
Xorg crashes at startup on Rocky 9.2 with Nvidia driver > 510.xx Linux	2	530	June 23, 2023
[SOLVED] Cannot properly install drivers on Redhat x86 r5 (Tikanga) for GTX 980, using LINUX X64 (AM Linux	8	2928	November 20, 2014
CentOS7 with two Quadro P2000's has "No devices detected" running X Linux	3	2894	March 30, 2018
GeForce GT 540M : (EE) No devices detected Linux	1	2598	September 11, 2017
X fails to start after installing cuda device drivers CUDA Programming and Performance	2	2119	July 3, 2012
GTX 970M and 346.16 BETA - fails to start x-server Linux	1	1240	November 21, 2014
P100 Issues on EL6/7 - /proc/driver/nvidia/gpus/XX/information output is ?? and can't run X Linux	6	2770	October 14, 2021
How do I fix "Oh no something has gone wrong" screen problem after installing NVIDIA driver ? CUDA Setup and Installation	1	5473	July 3, 2018
Nvidia driver with GTX 650 and arch linux. refusing more than 1024x768 [solved] Linux	10	11422	September 9, 2013
Dual 295 Problem on Jaunty with Xorg Configuration Woes getting CUDA Programming and Performance	5	4317	August 7, 2009

390.154 with Xorg 1.20.11 on Rocky Linux 9 "No Devices Found" and GetCaptureBufferSize failed

Related topics