GPU failing to initialize

Hi there,

I’m trying to run an A100 under Ubuntu 20.04 in a Dell server but unfortunately it currently does not show up in the nvidia-smi tool. The setup looks like that:

  • NVIDIA A100 PCIe 40 GB
  • Ubuntu 20.04 Desktop
  • Linux kernel 5.4.0-65-lowlatency
  • NVIDIA driver: 510.108.03

When I run nvidia_smi it repots

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

dmesg shows some errors:

NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:65:00.0)
nvidia: probe of 0000:65:00.0 failed with error -1
NVRM: The NVIDIA probe routine failed for 1 device(s).
NVRM: None of the NVIDIA devices were initialized.

It seems to be a quite common error, but none of the solutions presented in the forum did fix it for me.

I tried to:

  • switch to the latest linux kernel
  • switch to the latest NVIDIA driver
  • set kernel parameters: pic=realloc and pci=realloc=off
  • enable addresses above 4GB in the BIOS
  • change PCIe slots
  • disable secure boot

I attached the nvidia log. Any hint would be really appreciated.

Thanks so much.
nvidia-bug-report.log.gz (382.3 KB)