We have been experiencing an issue during the training of large language models, where the GPU sometimes generates unknown errors when the power usage is high. We have observed an error message in the kernel log stating, “GPU has fallen off the bus”. Rebooting the system provides a temporary solution to the problem.
We have found some similar issues, where it has been suggested that the issue might be related to the power supply. In our logs, we also noticed the CPU exhibiting similar problems, leading us to suspect a potential power supply issue. However, when we reached out to our IT department, they informed us that they had examined the system logs for both servers and found that the systems had not reached anywhere near their peak power usage capacity. Furthermore, they reported that the CPU and cooling were well below any thresholds that would indicate issues.
nvidia-bug-report.log.gz (1.9 MB)