I am playing with a couple of recently obtained A100s with some Asus B250 motherboard + Centos 8. After some extensive playing around (disable CSM in UEFI, above 4G decoding was already enabled, etc), I still got BAR errors in dmesg, such as:
BAR 1: no space for [mem size 0x1000000000 64bit pref]
BAR 1: failed to assign [mem size 0x1000000000 64bit pref]
I eventually got it working with this kernel parameter: pci=nocrs. nvidia-smi and our cuda test program would happily recognize the cards. I read online that pci=nocrs was the default setting in the pre-3 linux kernel. As I haven’t loaded our real training tasks (still many days away), do we know any known drawbacks for using pci=nocrs option?
Thanks ahead.
Ben