SOLVED: Can't get the video driver to load for A100's installed in Dell R750xa

Using Ubuntu 18.04, I can’t get the Nvidia dkms device driver to load. I’ve tried various versions of the driver ranging from 460 to 515. The problem seems to be that the system is unable to allocate memory for the driver. I think the first related error message I see in /var/log/syslog is

device has non-compliant BARs; disabling IO/MEM decoding

While I’m not certain this refers to one of the 4 A100’s in this system, it seems likely. This is then the entire explanation for why the driver doesn’t load, as the linux kernel won’t attempt to allocate memory for these. Subsequent error messages:

can't claim BAR 6 [mem 0xfffc0000-0xffffffff pref]: no compatible bridge      window

pci 0000:31:00.0: BAR 6: failed to assign [mem size 0x00100000 pref]

NVRM: BAR0 is 0M @ 0x0 (PCI:0000:17:00.0)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:ca:00.0)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

NVRM: None of the NVIDIA devices were initialized.
nvidia-nvlink: Unregistered Nvlink Core, major device number 236

These aren’t consecutive, exhaustive, or in any particular order. I’m worried that perhaps the Nvidia A100 card isn’t compatible with the Dell R750xa system board (even though we purchased the system from Dell with GPUs).

Having looked through some similar forum posts for suggestions …

  • The system doesn’t appear to have a configurable Secure Boot, so am assuming this is off
  • The BIOS is fully updated
  • nvidia-persistenced was failing to run, but now appears to be OK
# systemctl status nvidia-persistenced
   nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2022-08-05 14:53:37 CDT; 2 days ago

Any thoughts on what I could try next?

Hi @pgoetz1 and welcome to the NVIDIA developer forums!

One suggestion I would have is to check your BIOS for settings regarding “Above 4G Decoding” and also search for that topic here in the forums. This is a known configuration issue that would cause exactly the error messages you see.

In case that does not help I would highly recommend to get into contact with DELL support to clarify any compatibility doubts you might have.

As a side question out of interest, how was nvidia-persistenced failing before?

Hopefully you get your issues resolved!

Hi Markus – Thanks for the tips! I’ve already contacted Dell, updated the firmware, and checked to make sure that reallocating PCIe device memory to high memory is enabled (without this, the PERC controller device driver won’t load and you pretty quickly notice that you don’t have any disks).

One of my colleagues stumbled across a solution posted to a Dell Forum which I will write up and mark this as solved.

1 Like

The solution was to set the following kernel command line parameter:

pci=realloc=off

This works with either the stock kernel on Ubuntu 18.04 and the HWE kernel. I only tested it with the nvidia-driver-515-server package supplied by Ubuntu, but I strongly suspect it will work with other driver versions as well.

To save a potential reader time, here are the steps:

Edit /etc/default/grub and append the preceding to GRUB_CMDLINE_LINUX_DEFAULT, for example:

GRUB_CMDLINE_LINUX_DEFAULT="systemd.log_default=debug  pci=realloc=off"

Update grub:

# update-grub

Reboot. After this the drivers load as expected, nvidia-smi produces results, the nvidia-persistenced service loads properly, etc.

3 Likes

That is great news! Thank you for sharing the solution!

And have fun with your new GPU workstation!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.