Using Ubuntu 18.04, I can’t get the Nvidia dkms device driver to load. I’ve tried various versions of the driver ranging from 460 to 515. The problem seems to be that the system is unable to allocate memory for the driver. I think the first related error message I see in /var/log/syslog is
device has non-compliant BARs; disabling IO/MEM decoding
While I’m not certain this refers to one of the 4 A100’s in this system, it seems likely. This is then the entire explanation for why the driver doesn’t load, as the linux kernel won’t attempt to allocate memory for these. Subsequent error messages:
can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge window
pci 0000:31:00.0: BAR 6: failed to assign [mem size 0x00100000 pref]
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:17:00.0)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:ca:00.0)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: None of the NVIDIA devices were initialized.
nvidia-nvlink: Unregistered Nvlink Core, major device number 236
These aren’t consecutive, exhaustive, or in any particular order. I’m worried that perhaps the Nvidia A100 card isn’t compatible with the Dell R750xa system board (even though we purchased the system from Dell with GPUs).
Having looked through some similar forum posts for suggestions …
- The system doesn’t appear to have a configurable Secure Boot, so am assuming this is off
- The BIOS is fully updated
- nvidia-persistenced was failing to run, but now appears to be OK
# systemctl status nvidia-persistenced
nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2022-08-05 14:53:37 CDT; 2 days ago
Any thoughts on what I could try next?