Hello everyone,
I encountered a problem while running my NVIDIA GPU. The GPU suddenly crashes and stops working, and the terminal displays the following error messages:
After the crash, the GPU becomes unresponsive. Restarting the system temporarily resolves the issue, but it eventually recurs.
Issue Details:
- The error indicates a power fault on PCIe Slot 2, as reported by
pcihp
(PCIe hot-plug). - Another error message shows a failure to sync register
0x4f0800
with error code-5
(typically indicating an I/O error). - The crash can occur under high GPU load, but sometimes it also happens during regular usage.
My Questions:
- Power Fault Issue: Does the “Power fault” mean there is an issue with power delivery to the GPU, or could it be a problem with the GPU itself? I’ve checked the external power connections, and everything seems to be in place.
- Register Sync Failure: Is the “Unable to sync register” error related to the power fault, or could it be due to a hardware or driver issue with the GPU?
- How to Troubleshoot: What steps should I take to further diagnose the problem? When I connect two GPUs to the motherboard for an NCCL P2P stress test, I often encounter a situation where both GPUs simultaneously stop working.
Any advice on how to resolve or further investigate this issue would be greatly appreciated. Thank you in advance!
Additional Information:
- Environment:
- CPU: Intel N97
- System: Ubuntu 22.04
- Kernel: 5.15.0-119-generic
- GPU Model: NVIDIA A4000
- Driver Version: 560.35.03
- Power Supply: 650W