Failed to allocate NvKmsKapiDevice and Failed to register device (Rocky 9.5. and Kernel 6.12.9)

Hello, we are facing such a problem.
We had the following server configuration: Rocky 8 and kernel 5.14, somewhere around 6.8
One of the servers has 6 Tesla T4
nvidia-smi graphics cards - it worked without problems, it showed everything.

There is an urgent need to upgrade the system to Rocky 9 and kernel 6.8+ (in our case, yum updated it to kernel-ml 6.12.9)
I don’t remember what the smi and cuda versions were :(

After that, we started having problems.,

  • nvidia-smi does not show all graphics cards as it used to
  • nvidia-smi may cause server restart
  • After installing nvidia-driver and restarting the server, the server could crash and restart again.

The key mistakes are:

[ 6.072708] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 570.86.15 Thu Jan 23 23:23:10 UTC 2025
[ 6.126896] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 6.169033] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 6.209091] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 570.86.15 Thu Jan 23 22:30:06 UTC 2025
[ 6.215808] [drm] [nvidia-drm] [GPU ID 0x00001b00] Loading driver
[ 7.597873] [drm] Initialized nvidia-drm 0.0.0 for 0000:1b:00.0 on minor 1
[ 7.597890] nvidia 0000:1b:00.0: [drm] No compatible format found
[ 7.597893] nvidia 0000:1b:00.0: [drm] Cannot find any crtc or sizes
[ 7.598099] [drm] [nvidia-drm] [GPU ID 0x00001c00] Loading driver
[ 8.638439] [drm] Initialized nvidia-drm 0.0.0 for 0000:1c:00.0 on minor 2
[ 8.638459] nvidia 0000:1c:00.0: [drm] No compatible format found
[ 8.638462] nvidia 0000:1c:00.0: [drm] Cannot find any crtc or sizes
[ 8.638658] [drm] [nvidia-drm] [GPU ID 0x00001e00] Loading driver
[ 9.684298] [drm] Initialized nvidia-drm 0.0.0 for 0000:1e:00.0 on minor 3
[ 9.684314] nvidia 0000:1e:00.0: [drm] No compatible format found
[ 9.684317] nvidia 0000:1e:00.0: [drm] Cannot find any crtc or sizes
[ 9.684548] [drm] [nvidia-drm] [GPU ID 0x00003f00] Loading driver
[ 10.591535] resource: resource sanity check: requesting [mem 0x00000000b7700000-0x00000000b86fffff], which spans more than PCI Bus 0000:3b [mem 0xb5000000-0xb84fffff]
[ 10.591541] caller _nv046819rm+0x3a/0xb0 [nvidia] mapping multiple BARs
[ 10.600495] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00003f00] Failed to allocate NvKmsKapiDevice
[ 10.600679] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00003f00] Failed to register device
[ 10.600836] [drm] [nvidia-drm] [GPU ID 0x00004000] Loading driver
[ 11.819075] resource: resource sanity check: requesting [mem 0x00000000b5700000-0x00000000b66fffff], which spans more than PCI Bus 0000:40 [mem 0xb5000000-0xb63fffff]
[ 11.819083] caller _nv046819rm+0x3a/0xb0 [nvidia] mapping multiple BARs
[ 11.827706] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004000] Failed to allocate NvKmsKapiDevice
[ 11.827827] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004000] Failed to register device
[ 11.827974] [drm] [nvidia-drm] [GPU ID 0x00005e00] Loading driver
[ 13.137452] [drm] Initialized nvidia-drm 0.0.0 for 0000:5e:00.0 on minor 4
[ 13.137470] nvidia 0000:5e:00.0: [drm] No compatible format found
[ 13.137473] nvidia 0000:5e:00.0: [drm] Cannot find any crtc or sizes

We tried changing /etc/default/grub
We tried installing different driver versions (suitable for Tesla T4 | Linux 64-bit RHEL 9, 570.86.15, 550.144.03, 535.230.02, 550.127.08 )
We tried installing via the run file.
There is no result :(

How can we solve this problem?
And is it possible for nvidia-driver to work with Rocky 9.5 and kernel 6.12.9?

nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-7e72779e-00d4-6c68-ba84-7726909da764)
GPU 1: Tesla T4 (UUID: GPU-79405ede-bba6-5c34-d48d-1ab4d1d48a8e)
GPU 2: Tesla T4 (UUID: GPU-e6c0ca41-b425-75d8-68c3-6751367eb5b7)
GPU 3: Tesla T4 (UUID: GPU-a1165eb6-a4e4-3a89-0e91-d7e8e20d717f)
[root@scanh2-4 ~]# lspci | grep -i nvidia
1b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
1c:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
1e:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
3f:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
40:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
5e:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

dmesg | grep -i 40:00.0
[ 0.691219] pci 0000:40:00.0: [10de:1eb8] type 00 class 0x030200 PCIe Endpoint
[ 0.691238] pci 0000:40:00.0: BAR 0 [mem 0xb5000000-0xb5ffffff]
[ 0.691255] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]
[ 0.691271] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]
[ 0.691294] pci 0000:40:00.0: enabling Extended Tags
[ 0.691321] pci 0000:40:00.0: Enabling HDA controller
[ 0.691379] pci 0000:40:00.0: PME# supported from D0 D3hot D3cold
[ 0.691418] pci 0000:40:00.0: VF BAR 0 [mem 0xb6000000-0xb603ffff]
[ 0.691420] pci 0000:40:00.0: VF BAR 0 [mem 0xb6000000-0xb63fffff]: contains BAR 0 for 16 VFs
[ 0.691429] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afd8fffffff 64bit pref]
[ 0.691430] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: contains BAR 1 for 16 VFs
[ 0.691439] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afe91ffffff 64bit pref]
[ 0.691440] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: contains BAR 3 for 16 VFs
[ 0.717385] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.717387] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.717388] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.717390] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: can’t claim; no compatible bridge window
[ 0.752175] pci 0000:40:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can’t assign; no space
[ 0.752176] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: failed to assign
[ 0.752177] pci 0000:40:00.0: VF BAR 1 [mem size 0x100000000 64bit pref]: can’t assign; no space
[ 0.752179] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: failed to assign
[ 0.752180] pci 0000:40:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can’t assign; no space
[ 0.752182] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: failed to assign
[ 0.752183] pci 0000:40:00.0: VF BAR 3 [mem size 0x20000000 64bit pref]: can’t assign; no space
[ 0.752185] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: failed to assign
[ 0.752186] pci 0000:40:00.0: BAR 1 [mem size 0x10000000 64bit pref]: can’t assign; no space
[ 0.752188] pci 0000:40:00.0: BAR 1 [mem 0x3afe80000000-0x3afe8fffffff 64bit pref]: failed to assign
[ 0.752189] pci 0000:40:00.0: BAR 3 [mem size 0x02000000 64bit pref]: can’t assign; no space
[ 0.752190] pci 0000:40:00.0: BAR 3 [mem 0x3afeb0000000-0x3afeb1ffffff 64bit pref]: failed to assign
[ 0.752192] pci 0000:40:00.0: VF BAR 3 [mem size 0x20000000 64bit pref]: can’t assign; no space
[ 0.752193] pci 0000:40:00.0: VF BAR 3 [mem 0x3afe90000000-0x3afeafffffff 64bit pref]: failed to assign
[ 0.752195] pci 0000:40:00.0: VF BAR 1 [mem size 0x100000000 64bit pref]: can’t assign; no space
[ 0.752196] pci 0000:40:00.0: VF BAR 1 [mem 0x3afd80000000-0x3afe7fffffff 64bit pref]: failed to assign
[ 2.534998] nvidia 0000:40:00.0: enabling device (0100 → 0102)
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR3 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR4 is 0M @ 0x0 (PCI:0000:40:00.0)
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:40:00.0)
[ 4.780729] [drm] Initialized nvidia-drm 0.0.0 for 0000:40:00.0 on minor 5
[ 135.112806] NVRM: GPU 0000:40:00.0: RmInitAdapter failed! (0x24:0x72:1568)
[ 135.112962] NVRM: GPU 0000:40:00.0: rm_init_adapter failed, device minor number 4