I’m unable to have A5000 cards work with supermico board X12DPG-OA6. dmesg shows:
[ 37.010271] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:4f:00.0)
[ 37.010350] nvidia: probe of 0000:4f:00.0 failed with error -1
[ 37.010449] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:52:00.0)
[ 37.010538] nvidia: probe of 0000:52:00.0 failed with error -1
[ 37.010658] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:56:00.0)
[ 37.010736] nvidia: probe of 0000:56:00.0 failed with error -1
[ 37.010853] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:57:00.0)
[ 37.010936] nvidia: probe of 0000:57:00.0 failed with error -1
[ 37.011009] NVRM: The NVIDIA probe routine failed for 4 device(s).
The fresh boot dmesg log is in the attachment.
I configured the BIOS as suggested in several topics
Disable secure boot
Disable CMS
Enable “Above 4G Decoding”
I’ve tried reinstalling the OS (tried ubuntu-server 18.04 and ubuntu-desktop 18.04), but the problem still persists.
It’s a very common problem with pci resource allocation, i.e. the memory window sizes and regions a pci device wants (BAR). Initially assigned by the bios but sometimes incorrectly/incompatible so pci=realloc enables the kernel to change the regions.
The parameter is needed for the kernel to work properly with your mainboard, so this has to stay permanently. Unless a bios update is released that fixes it.
The ERR! state can be triggered either by overheating memory (I don’t think so, looking at the temperature of the working gpu) or not having configured the nvidia-persistenced daemon to start on boot. Please check for that.