Attempting to get this workhorse up. getting NVRM RmInitAdapter failed warnings upon nvidia-drm reports loading driver. after some initial work through finding an appropriate era driver (i’m using 470-server), blacklisting nouveau and enabling some addressing issues, i have arrived at the current roadblock, inability to init the adapter(s)
Dmesg:
nvidia 0000:09:00.0: enabling device (0140 → 0142)
[ 7.221771] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.256.02 Thu May 2 14:37:44 UTC 2024
[ 7.420133] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.256.02 Thu May 2 14:50:40 UTC 2024
[ 7.434686] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 7.439371] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0xffff:667)
[ 7.439434] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 7.442522] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[ 7.442833] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to register device
nvidia-smi:
jason@gronk:/etc/modprobe.d$ nvidia-smi
No devices were found
If i set options for nvidia-drm modeset to 0 in nvidia-graphics-drivers-kms.conf, then the dmesg errors go away but i still get no devices on clinfo/nvidia-smi. nvidia-bug-report.log.gz (143.2 KB)
Looking at your bug report, one sees a not uncommon situation, where the system is unable to map the required memory BAR area:
2024-12-31T22:37:14.259302+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259302+00:00 gronk kernel: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259303+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259303+00:00 gronk kernel: NVRM: BAR2 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259309+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259310+00:00 gronk kernel: NVRM: BAR3 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259311+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259311+00:00 gronk kernel: NVRM: BAR4 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259312+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
2024-12-31T22:37:14.259312+00:00 gronk kernel: NVRM: BAR5 is 0M @ 0x0 (PCI:0000:01:00.0)
2024-12-31T22:37:14.259336+00:00 gronk kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
There was a post here a while ago, unresolved unfortunately:
Note, in the reference I make to the Microway link in the above post, I said:
“BAR1 requirement is shown as 16MB”. It should be 16GB. Unfortunately I can no longer edit that post.
Here’s a post where Robert covers some things to check, but often vendor specific BIOS tweaks are required, hence Tesla cards normally being supplied in qualified servers.
If you do have a qualified system, then perhaps the BIOS settings need checking.
Thank you for these comments. I certainly do not have a qualified system so i will have to keep that as a possibility once ive eliminated any other known issues. I had read that a commandline switch was necessary to allow the kernel to reallocate the BAR allocated by the BIOS if it was invalid. this fixed these messages, i thought, after updating grub and restarting.
I’m not certain how many install histories are included in the bug reporting tool, is it possible that these messages are from a timeframe relevant to an earlier install?
thank you again for the careful look. I did a thorough purge of all cuda and nvidia packages, verified the drivers were no longer listed in lsmod.
apt search nvidia revealed an nvidia-driver-assistant package which once run instructed the cuda-driver metapackage. this installed properly, and i still see the errors mapping multiple bars. and citing a sanity check for the request vs PCI bus window.
2025-01-02T05:04:01.196250+00:00 gronk kernel: resource: resource sanity check: requesting [mem 0x00000000f0700000-0x00000000f16fffff], which spans more than PCI Bus 0000:01 [mem 0xf0000000-0xf0ffffff]
2025-01-02T05:04:01.196268+00:00 gronk kernel: caller os_map_kernel_space+0x120/0x130 [nvidia] mapping multiple BARs
2025-01-02T05:04:01.223027+00:00 gronk kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x72:1516)
2025-01-02T05:04:01.223042+00:00 gronk kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
so i believe the avenue you are suggesting is the right one. i need to look at the platform more closely/test on a more likely qualified platform.