"Xid: 79, GPU has fallen off the bus" Error during High GPU Power Usage - Power Supply Issue?

We have been experiencing an issue during the training of large language models, where the GPU sometimes generates unknown errors when the power usage is high. We have observed an error message in the kernel log stating, “GPU has fallen off the bus”. Rebooting the system provides a temporary solution to the problem.

We have found some similar issues, where it has been suggested that the issue might be related to the power supply. In our logs, we also noticed the CPU exhibiting similar problems, leading us to suspect a potential power supply issue. However, when we reached out to our IT department, they informed us that they had examined the system logs for both servers and found that the systems had not reached anywhere near their peak power usage capacity. Furthermore, they reported that the CPU and cooling were well below any thresholds that would indicate issues.

nvidia-bug-report.log.gz (1.9 MB)

Since both gpus are affected, I also suspect a psu issue on gpu boost situations, the psu detecting it as a short circuit. To check, please limit the clocks to prevent boost, e.g.
nvidia-smi -lgc 300,1500

About the odd messages from the cpu, please read this:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.