8x Tesla M40 on Ubuntu Server 24.04 "Failed to allocate NvKmsKapiDevice"

Hi there,

Attempting to get this workhorse up. getting NVRM RmInitAdapter failed warnings upon nvidia-drm reports loading driver. after some initial work through finding an appropriate era driver (i’m using 470-server), blacklisting nouveau and enabling some addressing issues, i have arrived at the current roadblock, inability to init the adapter(s)

Dmesg:

nvidia 0000:09:00.0: enabling device (0140 → 0142)
[ 7.221771] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.256.02 Thu May 2 14:37:44 UTC 2024
[ 7.420133] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.256.02 Thu May 2 14:50:40 UTC 2024
[ 7.434686] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 7.439371] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 7.439434] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 7.442522] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 7.442833] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to register device

nvidia-smi:

jason@gronk:/etc/modprobe.d$ nvidia-smi
No devices were found

lspci -vvv:

08:00.0 3D controller: NVIDIA Corporation GM200GL [Tesla M40] (rev a1)
Subsystem: NVIDIA Corporation GM200GL [Tesla M40]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 18
Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Capabilities:
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

inxi -G:

Graphics:
Device-1: Intel Xeon E3-1200 v2/3rd Gen Core processor Graphics driver: i915 v: kernel
Device-2: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-3: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-4: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-5: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-6: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-7: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-8: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Device-9: NVIDIA GM200GL [Tesla M40] driver: nvidia v: 470.256.02
Display: server: X.org v: 1.21.1.11 driver: gpu: i915 tty: 120x30 resolution: 1600x900
API: EGL v: 1.5 drivers: swrast platforms: surfaceless,device
API: OpenGL v: 4.5 vendor: mesa v: 24.0.9-0ubuntu0.3 note: console (EGL sourced)
renderer: llvmpipe (LLVM 17.0.6 128 bits)

If i set options for nvidia-drm modeset to 0 in nvidia-graphics-drivers-kms.conf, then the dmesg errors go away but i still get no devices on clinfo/nvidia-smi.
nvidia-bug-report.log.gz (143.2 KB)

Hi,

Looking at your bug report, one sees a not uncommon situation, where the system is unable to map the required memory BAR area:

2024-12-31T22:37:14.259302+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259302+00:00 gronk kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259303+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259303+00:00 gronk kernel: NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259309+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259310+00:00 gronk kernel: NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259311+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259311+00:00 gronk kernel: NVRM: BAR4 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259312+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259312+00:00 gronk kernel: NVRM: BAR5 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259336+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

There was a post here a while ago, unresolved unfortunately:

Note, in the reference I make to the Microway link in the above post, I said:
“BAR1 requirement is shown as 16MB”. It should be 16GB. Unfortunately I can no longer edit that post.

Here’s a post where Robert covers some things to check, but often vendor specific BIOS tweaks are required, hence Tesla cards normally being supplied in qualified servers.

If you do have a qualified system, then perhaps the BIOS settings need checking.

Thank you for these comments. I certainly do not have a qualified system so i will have to keep that as a possibility once ive eliminated any other known issues. I had read that a commandline switch was necessary to allow the kernel to reallocate the BAR allocated by the BIOS if it was invalid. this fixed these messages, i thought, after updating grub and restarting.

I’m not certain how many install histories are included in the bug reporting tool, is it possible that these messages are from a timeframe relevant to an earlier install?

Those lines I quoted are timestamped 31st December and came from /var/log/kern.log, so should be easy to check on a reboot.

Edit: Just checked and there are multiple entries through 2025-01-01 the same.

thank you again for the careful look. I did a thorough purge of all cuda and nvidia packages, verified the drivers were no longer listed in lsmod.

apt search nvidia revealed an nvidia-driver-assistant package which once run instructed the cuda-driver metapackage. this installed properly, and i still see the errors mapping multiple bars. and citing a sanity check for the request vs PCI bus window.

2025-01-02T05:04:01.196250+00:00 gronk kernel: resource: resource sanity check: requesting [mem 0x00000000f0700000-0x00000000f16fffff], which spans more than PCI Bus 0000:01 [mem 0xf0000000-0xf0ffffff]
2025-01-02T05:04:01.196268+00:00 gronk kernel: caller os_map_kernel_space+0x120/0x130 [nvidia] mapping multiple BARs
2025-01-02T05:04:01.223027+00:00 gronk kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1516)
2025-01-02T05:04:01.223042+00:00 gronk kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

so i believe the avenue you are suggesting is the right one. i need to look at the platform more closely/test on a more likely qualified platform.

Best regards.

Good luck. For what it’s worth, this document lists qualified hardware from the M40 era, but does not include that particular card.

I would not purchase one on the off chance, but having a few names may let you test, if you have access to any.