Hello Everyone,
I am looking for some help i ended up getting 4 x A100 SXM4 PG506 for my XE8545. i did the install myself. it seems like the install is fine and is noticed by the ubuntu server but for some reason after installing nvidia-smi does not show any devices here is the full install log + debug log and configurations i have tried multiple versions and even a fresh ubuntu install. Please let me know if you guys know how we can solve this problem.
Here is the bug report log. i can only have one media per post so i will post the rest of the stuff down at the buttom
nvidia-bug-report.log.gz (252.4 KB)
Here is my DMESG
lspci

nvidia-smi

/etc/modprobe.d/blacklist-nvidia-nouveau.conf

/etc/modprobe.d/nvidia.conf

Found this on the journalctl
Mar 23 05:05:36 hpcserver kernel: [ 175.299660] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x30:0x56:1113)
Mar 23 05:05:36 hpcserver kernel: [ 175.299928] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Mar 23 05:05:36 hpcserver kernel: [ 175.529527] NVRM: GPU 0000:81:00.0: RmInitAdapter failed! (0x30:0x56:1113)
Mar 23 05:05:36 hpcserver kernel: [ 175.529702] NVRM: GPU 0000:81:00.0: rm_init_adapter failed, device minor number 2
Mar 23 05:05:36 hpcserver kernel: [ 175.775610] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x30:0x56:1113)
Mar 23 05:05:36 hpcserver kernel: [ 175.775848] NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 1
Mar 23 05:05:36 hpcserver kernel: [ 176.005267] NVRM: GPU 0000:c1:00.0: RmInitAdapter failed! (0x30:0x56:1113)
Mar 23 05:05:36 hpcserver kernel: [ 176.005448] NVRM: GPU 0000:c1:00.0: rm_init_adapter failed, device minor number 3
---- Semi Fix i found
I had to downgrade my OS to Ubuntu server 20.04 LTS after that i installed the kernel headers
sudo apt-get install build-essential gcc-multilib dkms linux-headers-$(uname -r)
i black list the nouveau drivers
and then downloaded the nvidia drivers from the nvidia website with the generic linux x64 .run file instead of the specific .deb ubuntu 20.04 one.
/NVIDIA-Linux-x86_64-450.51.05.run
now it seems i get the NVLINK to show up and the GPUS to show up but the memory on the A100s seems wrong! i am trying to fix this issue now hopefully will update with a solution
Turns out the most important factor here was that the A100s that we ended up buying have a very old vbios version of : 92.00.19.00.01
there are only some version of drivers that even support this vbios like for example 450.51.06(Linux)/451.82(Windows) and the previous version it seems when i install this it is still getting the memory wrong. still a WIP
After alot of work back and forth that is the only drive i can get to show anything and even then it only shows partial for some reason it doesn’t recognize
i can’t seem to get fabric manager to run at all even though i have a HGX redstone board