RTXA5000 Kernel Panic - NVRM Xid 62 (Ubuntu20.04 - kernel=5.13.0-52-generic)

Hi,

I have reccurent kenrel panic on a compute server (Supermicro SYS-420GP-TNR) while training DNNs with Tensorflow (TF 2.9.3, with CUDA 11.2.1).

I trained two models in parallel, with CUDA_VISIBLE_DEVICES="1,2,3,4" and CUDA_VISIBLE_DEVICES="6,7,8,9".

Here logs in /var/log/syslog before kernel panic (these logs are not dumped on the disk during the kernel crash :

Dec 29 12:22:37 loki kernel: [18841.259090] NVRM: GPU at PCI:0000:56:00: GPU-0a8a8962-35bf-f7f4-b603-4dbb6ebd2ad9
Dec 29 12:22:37 loki kernel: [18841.259102] NVRM: GPU Board Serial Number: 1322721054966
Dec 29 12:22:37 loki kernel: [18841.259106] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.261040] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.262766] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.264499] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.266165] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.267801] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.269490] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.271137] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.272851] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.274500] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.276141] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.277925] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.279576] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.281342] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.283015] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.284704] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.286379] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.288030] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000

I installed the latest version of the driver for these GPUs (525.60.11), but still have this issue.

Before, 8 of the 10 RTX A5000 were in an older server (ASRockRack 3U8G-C612 : 2 Intel Xeon E5-2640 v4 with 8 PCIe3.0 lanes. Details here https://www.asrockrack.com/general/productdetail.asp?Model=3U8G-C612#Specifications), we move them to the new one (Supermicro SYS-420GP-TNR) with 2 new RTX-A5000 (since we can have 10 A5000 in that server).

More information about the server:

I tested the GPUs with gpu-burn, no error detected :

git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
docker run -it --rm --gpus all -v $PWD:/gpu-burn -w /gpu-burn nvidia/cuda:11.1.1-devel bash
make clean && make COMPUTE=86

./gpu-burn 300
> [...]
> errors: 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0   temps: 65 C - 68 C - 68 C - 68 C - 70 C - 68 C - 71 C - 67 C - 70 C - 70 C
> [...]
Tested 10 GPUs:
	GPU 0: OK
	GPU 1: OK
	GPU 2: OK
	GPU 3: OK
	GPU 4: OK
	GPU 5: OK
	GPU 6: OK
	GPU 7: OK
	GPU 8: OK
	GPU 9: OK

I don’t know what to do more … Let me add the report generated by nvidia-bug-report.sh when topic opened.

Many thanks in advance for your assistance.

Best regards,
Julien G
nvidia-bug-report.log.gz (5.4 MB)

Please, find attached :

  • the report nvidia-bug-repport.log
  • a screenshot of visible logs, from remote console on the IPMI, before rebooting the server after a crash

We use use the kernel “5.13.0-52”:

uname -a 
> Linux loki 5.13.0-52-generic #59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

We replaced 5.4.0-* by 5.13.0-* as recommended here : CUDA Installation Guide for Linux (nb when we found this page, it referred to 5.13.0-* now, it’s
5.15.0-*, if need I can update the kernel version quickly).

Also, before installing the Nvidia drivers we follow these steps:

  1. Edit/Create /etc/modprobe.d/blacklist-nouveau.conf:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Then

echo options nouveau modeset=0 | sudo tee -a
> options nouveau modeset=0

modprobe -r nouveau
update-initramfs -u
reboot now
  1. After the reboot, install recommended packages :
apt install linux-headers-$(uname -r) gcc make acpid dkms libglvnd-core-dev libglvnd0 libglvnd-dev

# and finally
chmod +x NVIDIA-Linux-x86_64-xxx.yy.zz.run
./NVIDIA-Linux-x86_64-xxx.yy.zz.run

And when we received the server, we had nan issues when training models on multiple GPUs.
The reason was ACS. We found this NCCL issue and follow these instructions by editing the BIOS :

BIOS : Advanced / Chipset Congiguration / North Bridge / IIO Configuration / Intel VT for Direct I/O
ACS Control : Disable

No more ‘nan’ issue with this setting.