Hello, we are facing such a problem.
We had the following server configuration: Rocky 8 and kernel 5.14, somewhere around 6.8
One of the servers has 6 Tesla T4
nvidia-smi graphics cards - it worked without problems, it showed everything.
There is an urgent need to upgrade the system to Rocky 9 and kernel 6.8+ (in our case, yum updated it to kernel-ml 6.12.9)
I don’t remember what the smi and cuda versions were :(
After that, we started having problems.,
nvidia-smi does not show all graphics cards as it used to
nvidia-smi may cause server restart
After installing nvidia-driver and restarting the server, the server could crash and restart again.
The key mistakes are:
[ 6.072708] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.86.15 Thu Jan 23 23:23:10 UTC 2025
[ 6.126896] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 6.169033] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 6.209091] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.86.15 Thu Jan 23 22:30:06 UTC 2025
[ 6.215808] [drm] [nvidia-drm] [GPU ID 0x00001b00] Loading driver
[ 7.597873] [drm] Initialized nvidia-drm 0.0.0 for 0000:1b:00.0 on minor 1
[ 7.597890] nvidia 0000:1b:00.0: [drm] No compatible format found
[ 7.597893] nvidia 0000:1b:00.0: [drm] Cannot find any crtc or sizes
[ 7.598099] [drm] [nvidia-drm] [GPU ID 0x00001c00] Loading driver
[ 8.638439] [drm] Initialized nvidia-drm 0.0.0 for 0000:1c:00.0 on minor 2
[ 8.638459] nvidia 0000:1c:00.0: [drm] No compatible format found
[ 8.638462] nvidia 0000:1c:00.0: [drm] Cannot find any crtc or sizes
[ 8.638658] [drm] [nvidia-drm] [GPU ID 0x00001e00] Loading driver
[ 9.684298] [drm] Initialized nvidia-drm 0.0.0 for 0000:1e:00.0 on minor 3
[ 9.684314] nvidia 0000:1e:00.0: [drm] No compatible format found
[ 9.684317] nvidia 0000:1e:00.0: [drm] Cannot find any crtc or sizes
[ 9.684548] [drm] [nvidia-drm] [GPU ID 0x00003f00] Loading driver
[ 10.591535] resource: resource sanity check: requesting [mem 0x00000000b7700000-0x00000000b86fffff], which spans more than PCI Bus 0000:3b [mem 0xb5000000-0xb84fffff]
[ 10.591541] caller _nv046819rm+0x3a/0xb0 [nvidia] mapping multiple BARs
[ 10.600495] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00003f00] Failed to allocate NvKmsKapiDevice
[ 10.600679] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00003f00] Failed to register device
[ 10.600836] [drm] [nvidia-drm] [GPU ID 0x00004000] Loading driver
[ 11.819075] resource: resource sanity check: requesting [mem 0x00000000b5700000-0x00000000b66fffff], which spans more than PCI Bus 0000:40 [mem 0xb5000000-0xb63fffff]
[ 11.819083] caller _nv046819rm+0x3a/0xb0 [nvidia] mapping multiple BARs
[ 11.827706] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004000] Failed to allocate NvKmsKapiDevice
[ 11.827827] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004000] Failed to register device
[ 11.827974] [drm] [nvidia-drm] [GPU ID 0x00005e00] Loading driver
[ 13.137452] [drm] Initialized nvidia-drm 0.0.0 for 0000:5e:00.0 on minor 4
[ 13.137470] nvidia 0000:5e:00.0: [drm] No compatible format found
[ 13.137473] nvidia 0000:5e:00.0: [drm] Cannot find any crtc or sizes
We tried changing /etc/default/grub
We tried installing different driver versions (suitable for Tesla T4 | Linux 64-bit RHEL 9, 570.86.15, 550.144.03, 535.230.02, 550.127.08 )
We tried installing via the run file.
There is no result :(
How can we solve this problem?
And is it possible for nvidia-driver to work with Rocky 9.5 and kernel 6.12.9?
as with the driver version, we were waiting for new ones - there are no results from them!
We tried swapping the video cards, but there was no result, at most we reduced the number of video cards to 3 in the nvidia table-smi
in the bios Above 4G decoding was enabled initially, we did not find any other suitable settings.
Can anyone help? It’s been so long, and we don’t have any results…
even when 3\6 video cards in nvidia-smi stopped working,
the errors were still the same.:
[root@scanh2-4 ~]# dmesg | grep -i 41:00.0
[ 0.569634] pci 0000:41:00.0: [10de:1eb8] type 00 class 0x030200 PCIe Endpoint
[ 0.569654] pci 0000:41:00.0: BAR 0 [mem 0xb5000000-0xb5ffffff]
[ 0.569670] pci 0000:41:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]
[ 0.569687] pci 0000:41:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]
[ 0.569711] pci 0000:41:00.0: enabling Extended Tags
[ 0.569738] pci 0000:41:00.0: Enabling HDA controller
[ 0.569800] pci 0000:41:00.0: PME# supported from D0 D3hot D3cold
[ 0.569842] pci 0000:41:00.0: VF BAR 0 [mem 0xb6000000-0xb603ffff]
[ 0.569843] pci 0000:41:00.0: VF BAR 0 [mem 0xb6000000-0xb63fffff]: contains BAR 0 for 16 VFs
[ 0.569852] pci 0000:41:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afd8fffffff 64bit pref]
[ 0.569854] pci 0000:41:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: contains BAR 1 for 16 VFs
[ 0.569863] pci 0000:41:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afe91ffffff 64bit pref]
[ 0.569865] pci 0000:41:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: contains BAR 3 for 16 VFs
[ 0.594141] pci 0000:41:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.594143] pci 0000:41:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.594144] pci 0000:41:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.594145] pci 0000:41:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.628696] pci 0000:41:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can’t assign; no space
[ 0.628698] pci 0000:41:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: failed to assign
[ 0.628699] pci 0000:41:00.0: VF BAR 1 [mem size 0x100000000 64bit pref]: can’t assign; no space
[ 0.628700] pci 0000:41:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: failed to assign
[ 0.628702] pci 0000:41:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can’t assign; no space
[ 0.628703] pci 0000:41:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: failed to assign
[ 0.628705] pci 0000:41:00.0: VF BAR 3 [mem size 0x20000000 64bit pref]: can’t assign; no space
[ 0.628706] pci 0000:41:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: failed to assign
[ 0.628708] pci 0000:41:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can’t assign; no space
[ 0.628709] pci 0000:41:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: failed to assign
[ 0.628710] pci 0000:41:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can’t assign; no space
[ 0.628712] pci 0000:41:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: failed to assign
[ 0.628713] pci 0000:41:00.0: VF BAR 3 [mem size 0x20000000 64bit pref]: can’t assign; no space
[ 0.628714] pci 0000:41:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: failed to assign
[ 0.628716] pci 0000:41:00.0: VF BAR 1 [mem size 0x100000000 64bit pref]: can’t assign; no space
[ 0.628717] pci 0000:41:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: failed to assign
[ 4.183433] nvidia 0000:41:00.0: enabling device (0100 → 0102)
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:41:00.0)
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:41:00.0)
NVRM: BAR3 is 0M @ 0x0 (PCI:0000:41:00.0)
NVRM: BAR4 is 0M @ 0x0 (PCI:0000:41:00.0)
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:41:00.0)
[ 4.308053] [drm] Initialized nvidia-drm 0.0.0 for 0000:41:00.0 on minor 5
[ 29.003897] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x24:0x72:1513)
[ 29.004003] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
[ 1386.646514] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[ 1386.647158] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
[ 2292.508048] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[ 2292.508175] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
[ 2867.208641] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[ 2867.208744] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
[ 3404.059292] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[ 3404.059426] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
[ 3421.906454] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[ 3421.907209] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
[ 5989.456130] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x62:0x40:2521)
[ 5989.456829] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 4
even when we built the kernel and installed 6.8.5, it didn’t work for us.
finally, we realized what the problem was:
we changed /etc/default/grub dozens of times, but as it turned out, there was no result,
if we use cat /proc/cmdline: it will show what is being used in grub right now and our new parameters were not there.
The problem is that in /etc/default/grub there is a GRUB_ENABLE_BLSCFG parameter, if it is true, then grub takes data from
/boot/loader/entries/…- yourKernelVersion
After we made it false, the problem on the 6.8.5 kernel was resolved.
(however, I didn’t decide on 6.13.5)
our parameters in grub are:
pci=use_crs pci=realloc=on pci=assign-busses