Nvidia lib crashes linux server

Hello

We are running ia/ml applications on Dell PowerEdge R730, linux centos 7.9 and Cuda 11.5.1 - version 495.29.05.

The server is equiped with 2 Tesla P100 cards.
03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
82:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)

About twice a month the server is crashing going to panic dumping the memory.
It happens under ia/ml job processing pressure and we found kernel module nvidia was the cause of the crash.

The crashes occur when a command using nvidia module is running. We are using nvidia-smi to monitor the gpu activity. Most of the time it happens with nvidia-smi but can happen with any other software using nvidia gpus. This time it happenned with conda command.

crash /usr/lib/debug/lib/modules/3.10.0-1160.62.1.el7.x86_64/vmlinux vmcore

[…]
DATE: Tue Aug 1 09:00:49 2023
UPTIME: 15 days, 12:42:02
LOAD AVERAGE: 12.07, 5.42, 3.50
TASKS: 8789
NODENAME: slhdg002
RELEASE: 3.10.0-1160.62.1.el7.x86_64
VERSION: #1 SMP Tue Apr 5 16:57:59 UTC 2022
MACHINE: x86_64 (2199 Mhz)
MEMORY: 511.9 GB
PANIC: “BUG: unable to handle kernel NULL pointer dereference at 0000000000000048”
PID: 1959
COMMAND: “conda”
TASK: ffff8d791ae6b180 [THREAD_INFO: ffff8d7c3b304000]
CPU: 19
STATE: TASK_RUNNING (PANIC)

cat vmcore-dmesg.txt

[…]
[1341716.439797] CPU: 19 PID: 1959 Comm: conda Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.62.1.el7.x86_64 #1
[…]
[1341716.439972] Call Trace:
[1341716.440122] [] ? _nv034134rm+0x162/0x2f0 [nvidia]
[1341716.440270] [] ? _nv032925rm+0x13f/0x210 [nvidia]
[1341716.440433] [] ? _nv032925rm+0x10e/0x210 [nvidia]
[…]

We tried workarounds from the nvidia forum and from the web without any success:

  • checked for irq conflicts (I will attach output of lspci -vvv)
  • blacklisted nouveau driver
  • set pcirealloc to off
  • checked dkms status

I was able to run nvidia-bug-report.sh when server is up, running and stable.I will attach vmcore-dmesg.txt and bug report file.

nvidia-bug-report.log.gz (981.0 KB)
lspcivvv (153.2 KB)
vmcore-dmesg.txt (252.5 KB)

495.29.05 is pretty old at this point. Does this problem still occur with the latest release (currently 535.86.05)?

Thanks for your answer.
At this time we cannot do an upgrade as ml jobs are running in production on these 2 servers.
We also have servers equiped with V100 gpu cards and cuda 11.5 (495.29.05) and no crash occuring.