Hardware:
CPU: Intel® Xeon® Gold 6240 * 2
GPU: RTX 3090 * 4
Memory: 256G
Basic System Informations:
OS Version: CentOS Linux release 7.9.2009
Linux Core Version: 3.10.0-1160.11.1.el7.x86_64
NVIDIA Driver Version: 455.23.04 (I also try some newer version of drivers, but it doesn’t help, same issues)
Our Linux server crash and reboot many times in the past month, and every time it crash, we didn’t running any deep learning training program, namely there is no other processes using the GPU
And we check the hardware logs of the sever, we found our hardware is ok
Then we debug with crash command, and we found the crash may caused by the NVIDIA drivers
we use the following command to check the crash information:
sudo crash /usr/lib/debug/lib/modules/3.10.0-1160.11.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2021-02-02-21:28:04/vmcore
bt
result:
NOTE: we also tried newer version of nvidia drivers, but it crash more frequently (between 2021.02.01 and 2021.02.09)
for example with nvidia driver version 460.xxx.xxx,
sudo crash /usr/lib/debug/lib/modules/3.10.0-1160.11.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2021-02-01-21:52:31/vmcore
the result
please help us solve this problem