Hardware and Software Specs:
Ubuntu LTS 22.04
Motherboard: X570S AORUS MASTER
Processor: AMD Ryzen 9 5950X 16-Core
Memory RAM: 64 GB
GPUs and driver:
GPU: 2 NVIDIA GeForce RTX 3090
Drivers: 535
CUDA: 11.7
We have been experiencing this issue during the training of computer vision models. The graphic interface stops working and the PC can only be accessed using a ssh connection. Also, rebooting the system just provides a temporary solution to the problem.
We have not found a specific method to replicate the crash. Sometimes, the model can train for several days without the error, and other times it crashes after a few minutes. The error always appears on GPU 0 (PCI:0000:04:00), meanwhile, GPU 1 never had any issue.
In the nvidia docs (https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf), they specified that this issue may be generated by:
HW error: we do not know how to test or prove this.
SW error: we got this issue using widely known deep learning models like YOLOv5,8 and with small models trained from scratch like ResNet18, so we don’t think it is software related.
System Memory corruption: we have tested with “compute-sanitizer --tool memcheck” and we have not found any issue, although we will check it deeper as there is no method to replicate the error.
Bus error: we have not seen any issue related to PCIe Bus Error in the logs as in other posts, so we do not think it may be the issue.
Thermal issue: we log the temperature and it never reaches the upper maximum threshold.
We have read that it also might be related to power consumption, but we use a PSU of 1500W and the PC has been able to train with the 2 graphics cards for hours. Also, the error appears when we are only training with GPU 0.
We attach the nvidia
nvidia-bug-report.log.gz (3.5 MB)
bug report log and the output of journalctl:
sudo journalctl
sep 15 22:35:01 envidia22 logger[46029]: GPU temp_monitor: temperature.gpu
sep 15 22:35:01 envidia22 logger[46029]: GPU temp_monitor: 71
sep 15 22:35:01 envidia22 logger[46029]: GPU temp_monitor: 51
sep 15 22:35:01 envidia22 CRON[46023]: pam_unix(cron:session): session closed for user root
sep 15 22:35:23 envidia22 kernel: NVRM: GPU at PCI:0000:04:00: GPU-bf23593e-65a4-fe08-327d-519ac9b4e37c
sep 15 22:35:23 envidia22 kernel: NVRM: Xid (PCI:0000:04:00): 79, pid=‘’, name=, GPU has fallen off the bus.
sep 15 22:35:23 envidia22 kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
sep 15 22:35:23 envidia22 kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
How may I proceed? Is there any method to fix or isolate this error?
Thanks in advance.