NVIDIA GPU driver consistent crash on CentOS 7 with RTX 3090

CPU: Intel® Xeon® Gold 6240 * 2
GPU: RTX 3090 * 4
Memory: 256G

Basic System Informations:
OS Version: CentOS Linux release 7.9.2009
Linux Core Version: 3.10.0-1160.11.1.el7.x86_64
NVIDIA Driver Version: 455.23.04 (I also try some newer version of drivers, but it doesn’t help, same issues)

Our Linux server crash and reboot many times in the past month, and every time it crash, we didn’t running any deep learning training program, namely there is no other processes using the GPU

And we check the hardware logs of the sever, we found our hardware is ok

Then we debug with crash command, and we found the crash may caused by the NVIDIA drivers

we use the following command to check the crash information:

sudo crash /usr/lib/debug/lib/modules/3.10.0-1160.11.1.el7.x86_64/vmlinux /var/crash/

bt result:

NOTE: we also tried newer version of nvidia drivers, but it crash more frequently (between 2021.02.01 and 2021.02.09)

for example with nvidia driver version 460.xxx.xxx,

sudo crash /usr/lib/debug/lib/modules/3.10.0-1160.11.1.el7.x86_64/vmlinux /var/crash/

the result

please help us solve this problem

@NVES @amrits @TomK

Hi @bindog,

Unfortunately, I am not a technical resource. I suggest submitting this as a bug.



Community Manager NVIDIA Developer Forums | NVIDIA