A100 on a CentOS 7 server gets removed after couple of minutes

,

Hi,

I installed an A100 on a Supermicro server, with PCIe 4 x16, which also has an RTX3090 from Gigabyte. I am using the latest driver (495.29.05) on a CentOS 7.9, and CUDA 11.5. It works like a charm when booting, but after a couple of minutes, the A100 disappears from the system, and the RTX3090 has “ERR!” status when checking it on nvidia-smi. I can see when running dmesg the following:

[ 41.060441] nvidia 0000:41:00.0: irq 373 for MSI/MSI-X
[ 41.060457] nvidia 0000:41:00.0: irq 374 for MSI/MSI-X
[ 41.060469] nvidia 0000:41:00.0: irq 375 for MSI/MSI-X
[ 41.060480] nvidia 0000:41:00.0: irq 376 for MSI/MSI-X
[ 41.060491] nvidia 0000:41:00.0: irq 377 for MSI/MSI-X
[ 41.060503] nvidia 0000:41:00.0: irq 378 for MSI/MSI-X
[ 41.816838] nvidia 0000:43:00.0: irq 379 for MSI/MSI-X
[ 352.069565] nvidia 0000:41:00.0: irq 373 for MSI/MSI-X
[ 352.069581] nvidia 0000:41:00.0: irq 374 for MSI/MSI-X
[ 352.069593] nvidia 0000:41:00.0: irq 375 for MSI/MSI-X
[ 352.069606] nvidia 0000:41:00.0: irq 376 for MSI/MSI-X
[ 352.069618] nvidia 0000:41:00.0: irq 377 for MSI/MSI-X
[ 352.069637] nvidia 0000:41:00.0: irq 378 for MSI/MSI-X
[ 352.711748] nvidia 0000:43:00.0: irq 379 for MSI/MSI-X
[ 532.828430] pciehp 0000:40:01.1:pcie004: Slot(6): Link Down
[ 532.829101] iommu: Removing device 0000:41:00.0 from group 34

More info:

  • cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 495.29.05 Thu Sep 30 16:00:29 UTC 2021
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

  • uname -a

Linux localhost.localdomain 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Any idea what is causing this problem? Should I start considering installing a newer OS like Ubuntu 20.04?

Thank you very much in advance,

Best regards,
Miguel

Ok, got it. It is related with overheating. I was checking the temperature with nvidia-smi, and it raises till 95C, and then it gets disconected.

Just in case somebody reads this entry, please check the temperature.

while true; do sleep 1; nvidia-smi >> output.txt; done

Check output.txt once your GPU dissapears.

Have a look to this: A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload) - #6 by generix

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.