RTXA5000 Kernel Panic - NVRM Xid 62 (Ubuntu20.04 - kernel=5.13.0-52-generic)

JGuillaumin · December 29, 2022, 1:53pm

Hi,

I have reccurent kenrel panic on a compute server (Supermicro SYS-420GP-TNR) while training DNNs with Tensorflow (TF 2.9.3, with CUDA 11.2.1).

I trained two models in parallel, with CUDA_VISIBLE_DEVICES="1,2,3,4" and CUDA_VISIBLE_DEVICES="6,7,8,9".

Here logs in /var/log/syslog before kernel panic (these logs are not dumped on the disk during the kernel crash :

Dec 29 12:22:37 loki kernel: [18841.259090] NVRM: GPU at PCI:0000:56:00: GPU-0a8a8962-35bf-f7f4-b603-4dbb6ebd2ad9
Dec 29 12:22:37 loki kernel: [18841.259102] NVRM: GPU Board Serial Number: 1322721054966
Dec 29 12:22:37 loki kernel: [18841.259106] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.261040] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.262766] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.264499] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.266165] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.267801] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.269490] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.271137] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.272851] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.274500] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.276141] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.277925] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.279576] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.281342] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.283015] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.284704] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.286379] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.288030] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000

I installed the latest version of the driver for these GPUs (525.60.11), but still have this issue.

Before, 8 of the 10 RTX A5000 were in an older server (ASRockRack 3U8G-C612 : 2 Intel Xeon E5-2640 v4 with 8 PCIe3.0 lanes. Details here https://www.asrockrack.com/general/productdetail.asp?Model=3U8G-C612#Specifications), we move them to the new one (Supermicro SYS-420GP-TNR) with 2 new RTX-A5000 (since we can have 10 A5000 in that server).

More information about the server:

Supermicro SYS-420GP-TNR
2 Intel Xeon Gold 5317
with 12 PCIe4.0 lanes
SYS-420GP-TNR | 4U | SuperServer | Products | Supermicro

I tested the GPUs with gpu-burn, no error detected :

git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
docker run -it --rm --gpus all -v $PWD:/gpu-burn -w /gpu-burn nvidia/cuda:11.1.1-devel bash
make clean && make COMPUTE=86

./gpu-burn 300
> [...]
> errors: 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0   temps: 65 C - 68 C - 68 C - 68 C - 70 C - 68 C - 71 C - 67 C - 70 C - 70 C
> [...]
Tested 10 GPUs:
	GPU 0: OK
	GPU 1: OK
	GPU 2: OK
	GPU 3: OK
	GPU 4: OK
	GPU 5: OK
	GPU 6: OK
	GPU 7: OK
	GPU 8: OK
	GPU 9: OK

I don’t know what to do more … Let me add the report generated by nvidia-bug-report.sh when topic opened.

Many thanks in advance for your assistance.

Best regards,
Julien G
nvidia-bug-report.log.gz (5.4 MB)

JGuillaumin · December 29, 2022, 1:57pm

Please, find attached :

the report nvidia-bug-repport.log
a screenshot of visible logs, from remote console on the IPMI, before rebooting the server after a crash

JGuillaumin · December 29, 2022, 2:15pm

We use use the kernel “5.13.0-52”:

uname -a 
> Linux loki 5.13.0-52-generic #59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

We replaced 5.4.0-* by 5.13.0-* as recommended here : CUDA Installation Guide for Linux (nb when we found this page, it referred to 5.13.0-* now, it’s
5.15.0-*, if need I can update the kernel version quickly).

Also, before installing the Nvidia drivers we follow these steps:

Edit/Create /etc/modprobe.d/blacklist-nouveau.conf:

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Then

echo options nouveau modeset=0 | sudo tee -a
> options nouveau modeset=0

modprobe -r nouveau
update-initramfs -u
reboot now

After the reboot, install recommended packages :

apt install linux-headers-$(uname -r) gcc make acpid dkms libglvnd-core-dev libglvnd0 libglvnd-dev

# and finally
chmod +x NVIDIA-Linux-x86_64-xxx.yy.zz.run
./NVIDIA-Linux-x86_64-xxx.yy.zz.run

And when we received the server, we had nan issues when training models on multiple GPUs.
The reason was ACS. We found this NCCL issue and follow these instructions by editing the BIOS :

BIOS : Advanced / Chipset Congiguration / North Bridge / IIO Configuration / Intel VT for Direct I/O
ACS Control : Disable

No more ‘nan’ issue with this setting.

Topic		Replies	Views
AlphaFold 2.3 on RTX A6000 kernel panic with 535.129.03 driver Linux cuda , kernel , ubuntu , python , driver , linux-driver	2	391	November 20, 2023
Kernel panic when training with PyTorch & GTX1080Ti Frameworks (archived) kernel	0	722	September 9, 2021
Kernel Crashing Following Upgrade to Jetpack 5.1/5.1.1 Jetson AGX Xavier boot	35	1119	April 2, 2024
Random kernel panics w GeForce GTX 1080 + 470.42.01 (Ubuntu 20.04) Linux kernel , ubuntu	0	729	July 15, 2021
POWER8 minsky (S822LC) nvidia stalls and kernel panic CUDA Setup and Installation	0	1083	October 17, 2017
Kernel Panics on CentOS7 - Geforce GTX 1080 Ti with Nvidia Driver 384.59 Linux	7	3337	December 5, 2017
Kernel crash on SuperMicro X10DRG-HT, 2x Xeon(R) CPU E5-2620 v4, 6x P5000 Linux	0	1743	November 14, 2017
Centos7.9 nvi-driver 470.161.03 random kernel crash Linux	1	286	February 15, 2023
nvidia 418.56, Fedora 29, dual nvidia graphics cards crashes with kernel 5.x on boot Linux	1	1134	April 10, 2019
Device not found (Ubuntu 20.04 / Dell Precision / RTX A4000 / RmInitAdapter failed) Linux ubuntu , nvidia-smi , dell	19	5703	August 14, 2022

RTXA5000 Kernel Panic - NVRM Xid 62 (Ubuntu20.04 - kernel=5.13.0-52-generic)

Related topics