Installation of 390.154 on CentOS7 works without any problems at all. X.org starts fine and detects my GPU.
However, with a fresh installation of Rocky Linux 9 things are not looking as good. The driver installation completes without any errors, but when setting graphical.target, /var/log/Xorg.0.log reports:
[ 2577.261] (II) NVIDIA dlloader X Driver 390.154 Wed Jun 22 04:48:53 UTC 2022
[ 2577.261] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 2577.263] (EE) No devices detected.
[ 2577.264] (EE)
Fatal server error:
[ 2577.264] (EE) no screens found(EE)
I compared some stats from /proc/driver/nvidia/gpus/0000:02:00.0/information
and with CentOS7:
Model: Quadro 2000
IRQ: 32
GPU UUID: GPU-b2085c8f-4dd7-a830-546c-d48d3a2e2e2e
Video BIOS: 70.06.4b.00.05
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
Bus Location: 0000:02:00.0
Device Minor: 0
And Rocky 9:
Model: Quadro 2000
IRQ: 38
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 40 bits
DMA Mask: 0xffffffffff
[nvidia-bug-report.log.gz|attachment](upload://hUN2NwWDNV4FW6gcFiRt8a5TDYt.gz) (546.4 KB)
Bus Location: 0000:02:00.0
Device Minor: 0
That’s about the only differences I’ve found so for. I’m not sure what to make of this information, though.
I executed nvidia-bug-report.sh
(attached) and found that right after /bin/nvidia-debugdump -D
there are some errors reported:
ERROR: GetCaptureBufferSize failed, Unknown Error, bufSize: 0x0
ERROR: internal_getDumpBuffer failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
I am uncertain if this is an issue with the driver package itself, or a kernel issue.
generix
September 27, 2022, 9:19am
2
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
nvidia-bug-report.log.gz (546.4 KB)
Hello @generix
Oh sorry, I thought I did that in my original post, something must have gone wrong.
I’m trying to attach it here.
Kind regards,
Martin
generix
September 27, 2022, 9:46am
4
The driver seems to be fine, just xorg seems to be blocked from using it. Tried disabling SELinux?
I’ve tried with setenforce 0 and also to more permanently by updating the kernel parameters and a reboot
(grubby --update-kernel ALL --args selinux=0
)
To no avail…
generix
September 27, 2022, 11:06am
6
Please post the output of
ls -l /dev/nvidi*
generix:
ls -l /dev/nvidi*
Oddly, If I check after a fresh reboot of the server, the device is not created:
[root@compute1 martin]# ls -l /dev/nvidi*
ls: cannot access ‘/dev/nvidi*’: No such file or directory
If I execute, for example, nvidia-smi once first
[root@compute1 cendio]# nvidia-smi
Tue Sep 27 14:08:52 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.154 Driver Version: 390.154 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro 2000 Off | 00000000:02:00.0 Off | N/A |
| 30% 46C P0 N/A / N/A | 0MiB / 964MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
and then they are created:
[root@compute1 cendio]# ls /dev/nvi*
/dev/nvidia0 /dev/nvidiactl
dmesg output after nvidia-smi:
[ 269.721744] resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[ 269.721839] caller _nv027866rm+0x58/0x90 [nvidia] mapping multiple BARs
[ 269.820828] resource sanity check: requesting [mem 0x000a0000-0x000bffff], which spans more than PCI Bus 0000:20 [mem 0x000a0000-0x000b0000 window]
[ 269.820912] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[ 269.821521] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[ 269.821601] caller _nv001015rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
Still same errors in Xorg.0.log though.
generix
September 27, 2022, 12:37pm
8
Please check whether the modules are loaded after a boot
lsmod |grep nvidia
Please post the output of
ls -l /dev/nvidi*
the permissions are important.
[root@compute1 ~]# uptime
14:47:59 up 1 min, 1 user, load average: 1.13, 0.48, 0.18
[root@compute1 ~]# lsmod |grep nvidia
nvidia_drm 57344 0
nvidia_modeset 1060864 1 nvidia_drm
nvidia 15892480 1 nvidia_modeset
drm_kms_helper 311296 1 nvidia_drm
drm 634880 3 drm_kms_helper,nvidia_drm
ipmi_msghandler 126976 2 ipmi_devintf,nvidia
[root@compute1 ~]# ls -l /dev/nvidi*
ls: cannot access '/dev/nvidi*': No such file or directory
[root@compute1 ~]# nvidia-smi
Tue Sep 27 14:48:09 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.154 Driver Version: 390.154 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro 2000 Off | 00000000:02:00.0 Off | N/A |
| 30% 45C P0 N/A / N/A | 0MiB / 964MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@compute1 ~]# lsmod |grep nvidia
nvidia_drm 57344 0
nvidia_modeset 1060864 1 nvidia_drm
nvidia 15892480 1 nvidia_modeset
drm_kms_helper 311296 1 nvidia_drm
drm 634880 3 drm_kms_helper,nvidia_drm
ipmi_msghandler 126976 2 ipmi_devintf,nvidia
[root@compute1 ~]# ls -l /dev/nvidi*
crw-rw-rw-. 1 root root 195, 0 Sep 27 14:48 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Sep 27 14:48 /dev/nvidiactl
generix
September 27, 2022, 2:58pm
10
Please try creating /etc/modprobe.d/nvidia.conf
options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0666 NVreg_ModifyDeviceFiles=1
I’m not able to test that right now.
I had to install RHEL8 to verify some things, and there the same driver package works just fine.
Are you interested in any logs collected from this installation for reference?
Hello again @generix
Server reinstalled with Rocky9 again, and same issue persist after your recommendation to the modprobe.d/nvidia.conf