CUDA driver installation issues with A100 on HPE Apollo 6500

boye.borg · June 2, 2021, 9:12am

Hey!
I’m trying to install the CUDA 11.1.1 driver from runfile on a HPE Apollo 6500 Gen10 with a A100 GPU with Ubuntu 18.04 (I also tried 20.04 and CUDA 11.3, but got the same results). I have followed the CUDA installation guide, but the installation fails with the following error (from nvidia-installer.log):

[ 4531.187983] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:88:00.0)
[ 4531.187985] NVRM: The system BIOS may have misconfigured your GPU.
[ 4531.187996] nvidia: probe of 0000:88:00.0 failed with error -1

I first ensured that 64-bit BAR was enabled in the BIOS, and it was. I also confirmed this by looking at dmesg (dmesg.log), and verifying that some of the referenced memory region of the GPU PCI bus (0000:88) was more than 32-bit. However, I also noticed some errors within the kernel messages:

[    1.064330] pci 0000:88:00.0: BAR 1: assigned [mem 0xd4000000000-0xd4fffffffff 64bit pref]
[    1.064339] pci 0000:88:00.0: BAR 8: assigned [mem 0xd5000000000-0xd5fffffffff 64bit pref]
[    1.064343] pci 0000:88:00.0: BAR 3: assigned [mem 0xd6000000000-0xd6001ffffff 64bit pref]
[    1.064352] pci 0000:88:00.0: BAR 10: assigned [mem 0xd6002000000-0xd6021ffffff 64bit pref]
[    1.064356] pci 0000:88:00.0: BAR 0: no space for [mem size 0x01000000]
[    1.064357] pci 0000:88:00.0: BAR 0: failed to assign [mem size 0x01000000]
[    1.064359] pci 0000:88:00.0: BAR 7: no space for [mem size 0x00400000]
[    1.064361] pci 0000:88:00.0: BAR 7: failed to assign [mem size 0x00400000]
[    1.064363] pci 0000:88:00.0: BAR 0: no space for [mem size 0x01000000]
[    1.064365] pci 0000:88:00.0: BAR 0: failed to assign [mem size 0x01000000]
[    1.064367] pci 0000:88:00.0: BAR 7: no space for [mem size 0x00400000]
[    1.064369] pci 0000:88:00.0: BAR 7: failed to assign [mem size 0x00400000]

I don’t know too much about memory assignments for PCI devices, but I have a feeling those messages shouldn’t be there.

After looking around on this forum and other places on the great internet, I tried to update grub to boot the kernel with pci=nocrs and pcie_aspm=off, since it seemed others had success with those options. However, after rebooting with those options enabled, I got the following error (from nvidia-installer_2.log):

[  425.546947] NVRM: The NVIDIA GPU 0000:88:00.0
               NVRM: (PCI ID: 10de:20f1) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[  425.547051] nvidia: probe of 0000:88:00.0 failed with error -1

As far as I could understand from other posts on this forum, this indicates some sort of hardware problem (i.e. insufficient power, a faulty PCI port etc.). The dmesg looks better though. It still contains a lot of failed to assign messages, but at the end it seems like it actually manages the assignments (dmesg_2.log):

[    0.992353] pci 0000:88:00.0: BAR 1: assigned [mem 0xd000000000-0xdfffffffff 64bit pref]
[    0.992362] pci 0000:88:00.0: BAR 8: assigned [mem 0xe000000000-0xefffffffff 64bit pref]
[    0.992367] pci 0000:88:00.0: BAR 3: assigned [mem 0xc800000000-0xc801ffffff 64bit pref]
[    0.992375] pci 0000:88:00.0: BAR 10: assigned [mem 0xc802000000-0xc821ffffff 64bit pref]
[    0.992379] pci 0000:88:00.0: BAR 0: assigned [mem 0xe1000000-0xe1ffffff]
[    0.992383] pci 0000:88:00.0: BAR 7: assigned [mem 0xe2000000-0xe23fffff]

My trouble shooting knowledge regarding this kind of stops here. I don’t know what more to try. Does anyone has any ideas what could be wrong? Or should I assume there is some sort of hardware issue?

Thanks in advance!

boye.borg · June 16, 2021, 7:53am

We just tried using RHEL 8 instead of Ubuntu, and that seems to work fine. The only “quirk” was that the runfile had to be installed with the --no-drm flag. So I guess there might be some kernel support problems on Ubuntu for the HPE Apollo 6500?

carlos.perez2 · September 14, 2021, 2:14pm

@boye.borg you found a fix ? We are running the same issue with the same hardware.