"Xid:79, GPU has fallen off the bus" training a deep learning model on Nvidia 3090

alvaro.nieva · September 21, 2023, 8:19am

Hardware and Software Specs:
Ubuntu LTS 22.04
Motherboard: X570S AORUS MASTER
Processor: AMD Ryzen 9 5950X 16-Core
Memory RAM: 64 GB
GPUs and driver:
GPU: 2 NVIDIA GeForce RTX 3090
Drivers: 535
CUDA: 11.7

We have been experiencing this issue during the training of computer vision models. The graphic interface stops working and the PC can only be accessed using a ssh connection. Also, rebooting the system just provides a temporary solution to the problem.

We have not found a specific method to replicate the crash. Sometimes, the model can train for several days without the error, and other times it crashes after a few minutes. The error always appears on GPU 0 (PCI:0000:04:00), meanwhile, GPU 1 never had any issue.

In the nvidia docs (https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf), they specified that this issue may be generated by:

HW error: we do not know how to test or prove this.
SW error: we got this issue using widely known deep learning models like YOLOv5,8 and with small models trained from scratch like ResNet18, so we don’t think it is software related.
System Memory corruption: we have tested with “compute-sanitizer --tool memcheck” and we have not found any issue, although we will check it deeper as there is no method to replicate the error.
Bus error: we have not seen any issue related to PCIe Bus Error in the logs as in other posts, so we do not think it may be the issue.
Thermal issue: we log the temperature and it never reaches the upper maximum threshold.
We have read that it also might be related to power consumption, but we use a PSU of 1500W and the PC has been able to train with the 2 graphics cards for hours. Also, the error appears when we are only training with GPU 0.

We attach the nvidia
nvidia-bug-report.log.gz (3.5 MB)
bug report log and the output of journalctl:

sudo journalctl

sep 15 22:35:01 envidia22 logger[46029]: GPU temp_monitor: temperature.gpu
sep 15 22:35:01 envidia22 logger[46029]: GPU temp_monitor: 71
sep 15 22:35:01 envidia22 logger[46029]: GPU temp_monitor: 51
sep 15 22:35:01 envidia22 CRON[46023]: pam_unix(cron:session): session closed for user root
sep 15 22:35:23 envidia22 kernel: NVRM: GPU at PCI:0000:04:00: GPU-bf23593e-65a4-fe08-327d-519ac9b4e37c
sep 15 22:35:23 envidia22 kernel: NVRM: Xid (PCI:0000:04:00): 79, pid=‘’, name=, GPU has fallen off the bus.
sep 15 22:35:23 envidia22 kernel: NVRM: GPU 0000:04:00.0: GPU has fallen off the bus.
sep 15 22:35:23 envidia22 kernel: NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

How may I proceed? Is there any method to fix or isolate this error?
Thanks in advance.

Topic		Replies	Views
Xid 79 Error: RTX 4090 GPU Falls Off Bus with NVIDIA Driver 535.161.07 on Ubuntu 22.04 LTS Server Linux	1	673	April 9, 2024
RTX 4090 - Xid 79 fell off the bus infrequently Linux	5	564	October 10, 2024
GPU has fallen off the bus Linux	0	229	August 20, 2024
Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus) Linux linux , linux-driver	5	353	April 29, 2025
Xid: 79, GPU has fallen off the bus (Arch linux, linux-ck-skylake 5.7.19, Nvidia 960, Driver: 455.23.04) Linux	0	426	October 9, 2020
Ubuntu 20.04 - RTX3090 - GPU has fallen off the bus Linux cuda , tensorflow , ubuntu , linux	6	4132	December 26, 2021
GPU has fallen off the bus Linux	5	11993	March 31, 2024
How to solve NVRM: GPU 0000:01:00.0: GPU has fallen off the bus completely? Linux tensorflow , ubuntu	3	9712	March 27, 2025
Fix "Xid 79 GPU has fallen off the bus" already! Linux	1	1682	January 10, 2021
GPU (4090) falls off the bus, Linux desktop General Topics and Other SDKs ubuntu , cudnn	2	596	June 19, 2024

"Xid:79, GPU has fallen off the bus" training a deep learning model on Nvidia 3090

Related topics