A100 on a CentOS 7 server gets removed after couple of minutes

miguelamda · December 19, 2021, 5:41pm

Hi,

I installed an A100 on a Supermicro server, with PCIe 4 x16, which also has an RTX3090 from Gigabyte. I am using the latest driver (495.29.05) on a CentOS 7.9, and CUDA 11.5. It works like a charm when booting, but after a couple of minutes, the A100 disappears from the system, and the RTX3090 has “ERR!” status when checking it on nvidia-smi. I can see when running dmesg the following:

[ 41.060441] nvidia 0000:41:00.0: irq 373 for MSI/MSI-X
[ 41.060457] nvidia 0000:41:00.0: irq 374 for MSI/MSI-X
[ 41.060469] nvidia 0000:41:00.0: irq 375 for MSI/MSI-X
[ 41.060480] nvidia 0000:41:00.0: irq 376 for MSI/MSI-X
[ 41.060491] nvidia 0000:41:00.0: irq 377 for MSI/MSI-X
[ 41.060503] nvidia 0000:41:00.0: irq 378 for MSI/MSI-X
[ 41.816838] nvidia 0000:43:00.0: irq 379 for MSI/MSI-X
[ 352.069565] nvidia 0000:41:00.0: irq 373 for MSI/MSI-X
[ 352.069581] nvidia 0000:41:00.0: irq 374 for MSI/MSI-X
[ 352.069593] nvidia 0000:41:00.0: irq 375 for MSI/MSI-X
[ 352.069606] nvidia 0000:41:00.0: irq 376 for MSI/MSI-X
[ 352.069618] nvidia 0000:41:00.0: irq 377 for MSI/MSI-X
[ 352.069637] nvidia 0000:41:00.0: irq 378 for MSI/MSI-X
[ 352.711748] nvidia 0000:43:00.0: irq 379 for MSI/MSI-X
[ 532.828430] pciehp 0000:40:01.1:pcie004: Slot(6): Link Down
[ 532.829101] iommu: Removing device 0000:41:00.0 from group 34

More info:

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 495.29.05 Thu Sep 30 16:00:29 UTC 2021
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

uname -a

Linux localhost.localdomain 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Any idea what is causing this problem? Should I start considering installing a newer OS like Ubuntu 20.04?

Thank you very much in advance,

Best regards,
Miguel

miguelamda · December 20, 2021, 6:49pm

Ok, got it. It is related with overheating. I was checking the temperature with nvidia-smi, and it raises till 95C, and then it gets disconected.

Just in case somebody reads this entry, please check the temperature.

while true; do sleep 1; nvidia-smi >> output.txt; done

Check output.txt once your GPU dissapears.

Have a look to this: A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload) - #6 by generix

system · January 3, 2022, 6:50pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
A100 not recognized by nvidia-smi but recognized by lspci Tesla Boards linux-driver-solutions , linux-driver	3	4391	March 28, 2024
Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU Linux	14	2543	January 31, 2022
NVIDIA GPUs on Ubuntu 22.04 LTS: one GPU keeps disappearing after installing nvidia driver Linux nvidia-smi , linux-driver	10	1041	April 22, 2024
Nvidia drivers hang in nv_rdtsc on CentOS 7 with Quadro K4000 Linux	2	1038	August 25, 2016
GPU is lost. Reboot the system to recover this GPU DGX User Forum hw , kernel	3	5592	March 8, 2022
RTX 3090: GPU has fallen off the bus (only Linux, on Windows everything is fine) Linux	8	2043	March 4, 2024
Intermittent "No devices were found" on CentOS 7 CUDA Setup and Installation	9	2500	December 7, 2021
A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload) Linux ubuntu , driver	7	3219	December 3, 2021
390.42 + Centos7.4(3.10.0-693.21.1.el7.x86_64). nvidia-smi gives "No devices were found" Linux	8	3140	March 27, 2018
Need Help with P100 installation (R730 Dell) CUDA Setup and Installation	8	1894	August 18, 2023

A100 on a CentOS 7 server gets removed after couple of minutes

Related topics