Hey!
I’m trying to install the CUDA 11.1.1 driver from runfile on a HPE Apollo 6500 Gen10 with a A100 GPU with Ubuntu 18.04 (I also tried 20.04 and CUDA 11.3, but got the same results). I have followed the CUDA installation guide, but the installation fails with the following error (from nvidia-installer.log):
[ 4531.187983] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:88:00.0)
[ 4531.187985] NVRM: The system BIOS may have misconfigured your GPU.
[ 4531.187996] nvidia: probe of 0000:88:00.0 failed with error -1
I first ensured that 64-bit BAR was enabled in the BIOS, and it was. I also confirmed this by looking at dmesg (dmesg.log), and verifying that some of the referenced memory region of the GPU PCI bus (0000:88) was more than 32-bit. However, I also noticed some errors within the kernel messages:
[ 1.064330] pci 0000:88:00.0: BAR 1: assigned [mem 0xd4000000000-0xd4fffffffff 64bit pref]
[ 1.064339] pci 0000:88:00.0: BAR 8: assigned [mem 0xd5000000000-0xd5fffffffff 64bit pref]
[ 1.064343] pci 0000:88:00.0: BAR 3: assigned [mem 0xd6000000000-0xd6001ffffff 64bit pref]
[ 1.064352] pci 0000:88:00.0: BAR 10: assigned [mem 0xd6002000000-0xd6021ffffff 64bit pref]
[ 1.064356] pci 0000:88:00.0: BAR 0: no space for [mem size 0x01000000]
[ 1.064357] pci 0000:88:00.0: BAR 0: failed to assign [mem size 0x01000000]
[ 1.064359] pci 0000:88:00.0: BAR 7: no space for [mem size 0x00400000]
[ 1.064361] pci 0000:88:00.0: BAR 7: failed to assign [mem size 0x00400000]
[ 1.064363] pci 0000:88:00.0: BAR 0: no space for [mem size 0x01000000]
[ 1.064365] pci 0000:88:00.0: BAR 0: failed to assign [mem size 0x01000000]
[ 1.064367] pci 0000:88:00.0: BAR 7: no space for [mem size 0x00400000]
[ 1.064369] pci 0000:88:00.0: BAR 7: failed to assign [mem size 0x00400000]
I don’t know too much about memory assignments for PCI devices, but I have a feeling those messages shouldn’t be there.
After looking around on this forum and other places on the great internet, I tried to update grub to boot the kernel with pci=nocrs
and pcie_aspm=off
, since it seemed others had success with those options. However, after rebooting with those options enabled, I got the following error (from nvidia-installer_2.log):
[ 425.546947] NVRM: The NVIDIA GPU 0000:88:00.0
NVRM: (PCI ID: 10de:20f1) installed in this system has
NVRM: fallen off the bus and is not responding to commands.
[ 425.547051] nvidia: probe of 0000:88:00.0 failed with error -1
As far as I could understand from other posts on this forum, this indicates some sort of hardware problem (i.e. insufficient power, a faulty PCI port etc.). The dmesg looks better though. It still contains a lot of failed to assign
messages, but at the end it seems like it actually manages the assignments (dmesg_2.log):
[ 0.992353] pci 0000:88:00.0: BAR 1: assigned [mem 0xd000000000-0xdfffffffff 64bit pref]
[ 0.992362] pci 0000:88:00.0: BAR 8: assigned [mem 0xe000000000-0xefffffffff 64bit pref]
[ 0.992367] pci 0000:88:00.0: BAR 3: assigned [mem 0xc800000000-0xc801ffffff 64bit pref]
[ 0.992375] pci 0000:88:00.0: BAR 10: assigned [mem 0xc802000000-0xc821ffffff 64bit pref]
[ 0.992379] pci 0000:88:00.0: BAR 0: assigned [mem 0xe1000000-0xe1ffffff]
[ 0.992383] pci 0000:88:00.0: BAR 7: assigned [mem 0xe2000000-0xe23fffff]
My trouble shooting knowledge regarding this kind of stops here. I don’t know what more to try. Does anyone has any ideas what could be wrong? Or should I assume there is some sort of hardware issue?
Thanks in advance!