Hi,
We keep getting the same issue on one of my A6000 GPUs, but when we return it to our vendor they claim all tests passed. The GPU has been rejected from RMA once but we keep encountering the same issue.
Hardware/Environment
- Server: SuperMicro SYS-420GP-TNR with 8x RTX A6000 GPUs
- All GPUs were bought early this year, and only one keeps failing
- Job run using NVIDIA NGC PyTorch 23.05 docker image. However, the issue also occurs in other docker images as well.
Describing the issue
- While training neural networks, after many hours,
nvidia-smidisplaysERR!under theGPU Fancolumn for the GPU with the issue.- Upon failing, the running job is killed, and while all other GPUs return to ambient temperature, the problematic GPU does not drop back to ambient temps.
journalctlshows that we first encountered XID 62, followed by XID 45 which leads to failure.- We monitored the temperature of the GPU at the time of failure, and found that it was between 65 deg C ~ 75 deg C.
- We sent the GPU for RMA but it was returned with no issues detected. Yet, we keep encountering the same issue, even when the GPU is moved to a different slots on the motherboard with better airflow. We have 7 other A6000s in the server, and have not encountered similar issues on any of them.
- Below is a short output of
journalctl | grep -i nvrm(serial number blanked) and attached is thenvidia-bug-report.log.gz.
Sep 25 22:15:46 cvlab20 kernel: NVRM: GPU at PCI:0000:ce:00: GPU-20a264c9-52eb-54c7-6559-52174dc4b869
Sep 25 22:15:46 cvlab20 kernel: NVRM: GPU Board Serial Number: xxxxxxxxxxxxxx
Sep 25 22:15:46 cvlab20 kernel: NVRM: Xid (PCI:0000:ce:00): 43, pid=470075, name=python, Ch 00000008
Sep 26 14:52:54 cvlab20 kernel: NVRM: GPU at PCI:0000:56:00: GPU-03d0c5de-d383-caba-3e0d-7044798169bf
Sep 26 14:52:54 cvlab20 kernel: NVRM: GPU Board Serial Number: xxxxxxxxxxxxxx
Sep 26 14:52:54 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
Sep 26 14:52:54 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 00000008
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 00000009
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 0000000a
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 0000000b
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 0000000c
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 0000000d
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 0000000e
Sep 26 14:52:55 cvlab20 kernel: NVRM: Xid (PCI:0000:56:00): 45, pid=94623, name=python, Ch 0000000f
How to proceed?
Even though RMA was rejected, we keep encountering the same issue on the same GPU. Could this be a faulty GPU? If so/not, how should we proceed from here?
Thanks in advance!
nvidia-bug-report.log.gz (3.6 MB)