Nvidia lib crashes linux server

mounir.stambouli · August 2, 2023, 9:50am

Hello

We are running ia/ml applications on Dell PowerEdge R730, linux centos 7.9 and Cuda 11.5.1 - version 495.29.05.

The server is equiped with 2 Tesla P100 cards.
03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
82:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)

About twice a month the server is crashing going to panic dumping the memory.
It happens under ia/ml job processing pressure and we found kernel module nvidia was the cause of the crash.

The crashes occur when a command using nvidia module is running. We are using nvidia-smi to monitor the gpu activity. Most of the time it happens with nvidia-smi but can happen with any other software using nvidia gpus. This time it happenned with conda command.

crash /usr/lib/debug/lib/modules/3.10.0-1160.62.1.el7.x86_64/vmlinux vmcore

[…]
DATE: Tue Aug 1 09:00:49 2023
UPTIME: 15 days, 12:42:02
LOAD AVERAGE: 12.07, 5.42, 3.50
TASKS: 8789
NODENAME: slhdg002
RELEASE: 3.10.0-1160.62.1.el7.x86_64
VERSION: #1 SMP Tue Apr 5 16:57:59 UTC 2022
MACHINE: x86_64 (2199 Mhz)
MEMORY: 511.9 GB
PANIC: “BUG: unable to handle kernel NULL pointer dereference at 0000000000000048”
PID: 1959
COMMAND: “conda”
TASK: ffff8d791ae6b180 [THREAD_INFO: ffff8d7c3b304000]
CPU: 19
STATE: TASK_RUNNING (PANIC)

cat vmcore-dmesg.txt

[…]
[1341716.439797] CPU: 19 PID: 1959 Comm: conda Kdump: loaded Tainted: P OE ------------ T 3.10.0-1160.62.1.el7.x86_64 #1
[…]
[1341716.439972] Call Trace:
[1341716.440122] [] ? _nv034134rm+0x162/0x2f0 [nvidia]
[1341716.440270] [] ? _nv032925rm+0x13f/0x210 [nvidia]
[1341716.440433] [] ? _nv032925rm+0x10e/0x210 [nvidia]
[…]

We tried workarounds from the nvidia forum and from the web without any success:

checked for irq conflicts (I will attach output of lspci -vvv)
blacklisted nouveau driver
set pcirealloc to off
checked dkms status

I was able to run nvidia-bug-report.sh when server is up, running and stable.I will attach vmcore-dmesg.txt and bug report file.

nvidia-bug-report.log.gz (981.0 KB)
lspcivvv (153.2 KB)
vmcore-dmesg.txt (252.5 KB)

aplattner · August 2, 2023, 6:30pm

495.29.05 is pretty old at this point. Does this problem still occur with the latest release (currently 535.86.05)?

mounir.stambouli · August 3, 2023, 9:17am

Thanks for your answer.
At this time we cannot do an upgrade as ml jobs are running in production on these 2 servers.
We also have servers equiped with V100 gpu cards and cuda 11.5 (495.29.05) and no crash occuring.

Topic		Replies	Views
Recent nvidia Tesla drivers cause system crashs on POWERNVL w/ P100 GPUs Linux hw , kernel , ubuntu	1	944	July 8, 2021
NVIDIA Open Kernel build in Gentoo Linux on WMWare ESXi Linux kernel , linux , drivers	4	134	May 13, 2025
Systeme crash after "nvidia-smi" command. Rhel7.6/A100 GPU Linux	14	2574	January 31, 2022
Bluescreen while running CUDA kernel CUDA Programming and Performance	5	7710	July 8, 2009
Crash with cuda and nvidia 450 Linux drive-cuda	2	729	July 5, 2020
Crash with kernel 4.5 and 4.6 Linux	8	5386	May 14, 2016
NVIDIA GPU driver consistent crash on CentOS 7 with RTX 3090 Drivers - Linux, Windows, MacOS nvbugs	3	1337	April 28, 2021
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	219	December 19, 2024
Kernel panic when training with PyTorch & GTX1080Ti Frameworks kernel	0	722	September 9, 2021
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	3480	August 19, 2023

Nvidia lib crashes linux server

crash /usr/lib/debug/lib/modules/3.10.0-1160.62.1.el7.x86_64/vmlinux vmcore

cat vmcore-dmesg.txt

Related topics