Hello
We are running ia/ml applications on Dell PowerEdge R730, linux centos 7.9 and Cuda 11.5.1 - version 495.29.05.
The server is equiped with 2 Tesla P100 cards.
03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
82:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
About twice a month the server is crashing going to panic dumping the memory.
It happens under ia/ml job processing pressure and we found kernel module nvidia was the cause of the crash.
The crashes occur when a command using nvidia module is running. We are using nvidia-smi to monitor the gpu activity. Most of the time it happens with nvidia-smi but can happen with any other software using nvidia gpus. This time it happenned with conda command.
crash /usr/lib/debug/lib/modules/3.10.0-1160.62.1.el7.x86_64/vmlinux vmcore
[…]
DATE: Tue Aug 1 09:00:49 2023
UPTIME: 15 days, 12:42:02
LOAD AVERAGE: 12.07, 5.42, 3.50
TASKS: 8789
NODENAME: slhdg002
RELEASE: 3.10.0-1160.62.1.el7.x86_64
VERSION: #1 SMP Tue Apr 5 16:57:59 UTC 2022
MACHINE: x86_64 (2199 Mhz)
MEMORY: 511.9 GB
PANIC: “BUG: unable to handle kernel NULL pointer dereference at 0000000000000048”
PID: 1959
COMMAND: “conda”
TASK: ffff8d791ae6b180 [THREAD_INFO: ffff8d7c3b304000]
CPU: 19
STATE: TASK_RUNNING (PANIC)
cat vmcore-dmesg.txt
[…]
[1341716.439797] CPU: 19 PID: 1959 Comm: conda Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.62.1.el7.x86_64 #1
[…]
[1341716.439972] Call Trace:
[1341716.440122] [] ? _nv034134rm+0x162/0x2f0 [nvidia]
[1341716.440270] [] ? _nv032925rm+0x13f/0x210 [nvidia]
[1341716.440433] [] ? _nv032925rm+0x10e/0x210 [nvidia]
[…]
We tried workarounds from the nvidia forum and from the web without any success:
- checked for irq conflicts (I will attach output of lspci -vvv)
- blacklisted nouveau driver
- set pcirealloc to off
- checked dkms status
I was able to run nvidia-bug-report.sh when server is up, running and stable.I will attach vmcore-dmesg.txt and bug report file.
nvidia-bug-report.log.gz (981.0 KB)
lspcivvv (153.2 KB)
vmcore-dmesg.txt (252.5 KB)