Hi,
I have reccurent kenrel panic on a compute server (Supermicro SYS-420GP-TNR) while training DNNs with Tensorflow (TF 2.9.3, with CUDA 11.2.1).
I trained two models in parallel, with CUDA_VISIBLE_DEVICES="1,2,3,4"
and CUDA_VISIBLE_DEVICES="6,7,8,9"
.
Here logs in /var/log/syslog
before kernel panic (these logs are not dumped on the disk during the kernel crash :
Dec 29 12:22:37 loki kernel: [18841.259090] NVRM: GPU at PCI:0000:56:00: GPU-0a8a8962-35bf-f7f4-b603-4dbb6ebd2ad9
Dec 29 12:22:37 loki kernel: [18841.259102] NVRM: GPU Board Serial Number: 1322721054966
Dec 29 12:22:37 loki kernel: [18841.259106] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.261040] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.262766] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.264499] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.266165] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.267801] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.269490] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.271137] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.272851] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.274500] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.276141] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.277925] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.279576] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.281342] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.283015] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.284704] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.286379] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Dec 29 12:22:37 loki kernel: [18841.288030] NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
I installed the latest version of the driver for these GPUs (525.60.11
), but still have this issue.
Before, 8 of the 10 RTX A5000 were in an older server (ASRockRack 3U8G-C612 : 2 Intel Xeon E5-2640 v4 with 8 PCIe3.0 lanes. Details here https://www.asrockrack.com/general/productdetail.asp?Model=3U8G-C612#Specifications), we move them to the new one (Supermicro SYS-420GP-TNR) with 2 new RTX-A5000 (since we can have 10 A5000 in that server).
More information about the server:
- Supermicro SYS-420GP-TNR
- 2 Intel Xeon Gold 5317
- with 12 PCIe4.0 lanes
- SYS-420GP-TNR | 4U | SuperServer | Products | Supermicro
I tested the GPUs with gpu-burn
, no error detected :
git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
docker run -it --rm --gpus all -v $PWD:/gpu-burn -w /gpu-burn nvidia/cuda:11.1.1-devel bash
make clean && make COMPUTE=86
./gpu-burn 300
> [...]
> errors: 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 temps: 65 C - 68 C - 68 C - 68 C - 70 C - 68 C - 71 C - 67 C - 70 C - 70 C
> [...]
Tested 10 GPUs:
GPU 0: OK
GPU 1: OK
GPU 2: OK
GPU 3: OK
GPU 4: OK
GPU 5: OK
GPU 6: OK
GPU 7: OK
GPU 8: OK
GPU 9: OK
I don’t know what to do more … Let me add the report generated by nvidia-bug-report.sh
when topic opened.
Many thanks in advance for your assistance.
Best regards,
Julien G
nvidia-bug-report.log.gz (5.4 MB)