Xid 79 Error: RTX 4090 GPU Falls Off Bus with NVIDIA Driver 535.161.07 on Ubuntu 22.04 LTS Server

andrey.arapov · April 9, 2024, 9:14am

Hello NVIDIA community and support team,

I am encountering a persistent and recurring issue with my Ubuntu 22.04 LTS server where the NVIDIA GPU is consistently falling off the bus, as indicated by Xid 79 errors. This problem has occurred three times in the past three days.

System Environment:

Operating System: Ubuntu 22.04.4 LTS (x86_64)
Linux Kernel: 5.15.0-101-generic
NVIDIA Driver Version: 535.161.07 (installed via ubuntu-drivers autoinstall)
Kubernetes Version: v1.28.6 (set up with kubespray v2.24.1)
K8s NVIDIA Device Plugin: NVIDIA/k8s-device-plugin 0.14.5 (delivered via Helm chart)

GPU Usage:
The GPUs are employed by Nimble project miners, which perform computations as detailed on nimble.technology. The mining scripts in use are available at this GitHub repository.

Error Messages Encountered:

root@node1:~# nvidia-smi 
Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error

and from the dmesg output:

root@node1:~# uptime 
 08:38:19 up 15:34,  1 user,  load average: 6.49, 7.16, 7.32
root@node1:~# dmesg -T | grep NVRM
[Mon Apr  8 17:04:05 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
[Tue Apr  9 01:21:38 2024] NVRM: GPU at PCI:0000:a1:00: GPU-979426f2-893a-7cbb-c4cf-81472f89a462
[Tue Apr  9 01:21:38 2024] NVRM: Xid (PCI:0000:a1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Tue Apr  9 01:21:38 2024] NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus.
[Tue Apr  9 01:21:38 2024] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

Normally nvidia-smi would report the following after reboot, until the Xid 79 issue cropped up:

root@node1:~# nvidia-smi | head -4
Tue Apr  9 09:13:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

root@node1:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-bd1622d9-a72a-fe8a-20d9-f3a7304619e2)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-5cfcf04e-39e4-282b-7ebe-871356efad24)
GPU 2: NVIDIA GeForce RTX 4090 (UUID: GPU-e70a3ebe-3657-8c2e-27b1-4c6aa1692e1d)
GPU 3: NVIDIA GeForce RTX 4090 (UUID: GPU-69f7831c-980f-8343-5b34-9b2968469835)
GPU 4: NVIDIA GeForce RTX 4090 (UUID: GPU-71f7bd7b-7f03-0c44-f12b-ae3431838e80)
GPU 5: NVIDIA GeForce RTX 4090 (UUID: GPU-979426f2-893a-7cbb-c4cf-81472f89a462)
GPU 6: NVIDIA GeForce RTX 4090 (UUID: GPU-6ed8c811-d1e6-def4-65b0-3998d557c78c)
GPU 7: NVIDIA GeForce RTX 4090 (UUID: GPU-cd8321b3-b419-f85a-3e09-014eb899c8d5)

Additionally
We have two more servers, the exact same configuration and GPU’s which do not experience this issue.

Steps Taken:
I have generated a bug report using nvidia-bug-report.sh and I am including the compressed log file (nvidia-bug-report.log.gz) with this post.

No recent system changes were made that could be directly linked to this issue. I have not yet been able to identify a specific action or event that triggers this error. The GPUs are under consistent load from the Nimble project miners when the issue occurs.

Please find the attached nvidia-bug-report.log.gz file for detailed system information and logs.

I am aware that personal information may be included in the bug report and consent to its use for the purpose of troubleshooting this issue.

Any guidance or assistance you can provide would be greatly appreciated.

Thank you for your time and help.

node1.pdx.nb.akash.pub_nvidia-bug-report.log.1.gz (1.6 MB)
node1.pdx.nb.akash.pub_nvidia-bug-report.log.2.gz (1.5 MB)

andrey.arapov · April 9, 2024, 6:38pm

We’ve replaced the server (mainboard) & the GPU’s.
I’ll post here in case we encounter any nvidia-related issues.

Topic		Replies	Views
RTX 4090 - Xid 79 fell off the bus infrequently Linux	5	564	October 10, 2024
NVRM XID 79 on Ubuntu 20.04 Linux	1	678	December 13, 2022
"Xid:79, GPU has fallen off the bus" training a deep learning model on Nvidia 3090 Linux nvbugs	0	654	September 21, 2023
Nvidia driver Xid 79 GPU crash while idling if ASPM L0s is enabled in UEFI BIOS (GPU has fallen off the bus) Linux linux , linux-driver	5	353	April 29, 2025
GPU (4090) falls off the bus, Linux desktop General Topics and Other SDKs ubuntu , cudnn	2	596	June 19, 2024
GPU has fallen off the bus Linux	0	229	August 20, 2024
Gefore RTX 3060Ti repeatedly falls off bus Linux	3	1202	May 6, 2021
GPU has fallen off the bus - GTX 1070 - nvidia-gfxG04-kmp-default-390.87 [Solved - dead GPU] Linux	9	1716	October 4, 2018
Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus" Linux llama	9	279	May 24, 2025
"GPU has fallen off the bus" in dGPU mode, AORUS 16X ASG, Mint 22 Linux kernel , ubuntu , driver , gaming , linux-driver	0	49	May 11, 2025

Xid 79 Error: RTX 4090 GPU Falls Off Bus with NVIDIA Driver 535.161.07 on Ubuntu 22.04 LTS Server

Related topics