NVIDIA A100 won't bind to drivers on SuperMicro M11SDV-8C+-LN4F

Hello,

I am having trouble getting an NVIDIA A100 to work on a SuperMicro M11SDV-8C±LN4F server board with little success. I am getting the “NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:05:00.0)” error whenever I boot or attempt to use NVIDIA-SMI, which reports that the driver is not active due to being unable to bind to any devices. The card can be found via lspci:

05:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev ff)

After many hours of Google I have tried the following remedies with no success:

  1. Ensuring that Above 4G Encoding is enabled (it’s enabled by default)
  2. pci=realloc
  3. pci=realloc=off
  4. pci=nocrs
  5. Trying the above kernel paramters with a pci-e rescan
  6. Ensuring that the OS is installed in EFI mode

I have also tried every solution above with RHEL 9, Rocky 9, Rocky 8, Alma 9, and Ubuntu Server 22.04.2.

Here’s the bug report dump: nvidia-bug-report.log.gz (51.8 KB)

Any guidance would be much appreciated. Alternatively, any recommendations for mini-itx form factor server boards that are known to work with the A100 would be very welcome.

Thanks in advance,
James

I’ve updated the BIOS to the most recent version and the problem persists.

It has nothing to do with resource mapping, the nvidia gpu is simply turned off thus its pci config space is all 0xff, triggering the error message. Please check it’s properly seated in its pcie slot, all power connectors are connected.

Thank you for the swift response. I’ve verified the seating of both the PCI-e slot and the power supply. Are there any other issues known to cause this behavior? I will continue to do some more troubleshooting in the meantime.

Thanks for you support,
James

The A100 doesn’t have own fans so it needs the server to provide the airflow through it otherwise it will overheat and shutdown.

After some more testing we’re determined that this is the issue.

Thanks for the support,
James

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.