I have the same problem. I have a new A100 80GB Pcle in my server.
I have tried the NVIDIA-Linux-x86_64-470.57.02.run and NVIDIA-Linux-x86_64-470.82.01.run as well as the .deb file.
But each time when I put nvidia-smi,I got kernel carsh and have this:
@adrien.paris the logs show “gpu fallen off the bus” messages, combining this with the server rebooting (which only the mainboard initiates) points to insufficient power. Please check your PSU. Furthermore, nvidia-persistenced needs to be configured to start on boot, otherwise the gpus will misbehave (gpu usage without processes, stuck in P0, crashes).
@Phoenix007 please provide a dmesg output from a clean boot.
[ 2.135828] pci 0000:ca:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[ 2.135830] pci 0000:ca:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]
[ 2.135832] pci 0000:ca:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[ 2.135834] pci 0000:ca:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[ 2.135836] pci 0000:ca:00.0: BAR 7: no space for [mem size 0x00500000]
[ 2.135838] pci 0000:ca:00.0: BAR 7: failed to assign [mem size 0x00500000]
Please enable “above 4G decoding”/“large/64bit BAR” in bios.
It’s the latest bios provided by DELL. I also tried with kernel 5.11.0-27, mentioned in compatilility list of CUDA toolkit. The error seems to be the same.
Yes, I’ve checked that setting. There only 3 options: 56TB (default) 12TB and 512GB.
Afer choosing the last one board doesn’t detect GPU. I got in iDRAC log message:
The data communication with the device video.slot.7-1 (GPU Controller in Slot 7 of Instance 1) is lost.
Options 56TB and 12TB works the same way - board shows GPU but nvidia driver crashes.