Xid 79 Error: RTX 4090 GPU Falls Off Bus with NVIDIA Driver 535.161.07 on Ubuntu 22.04 LTS Server

Hello NVIDIA community and support team,

I am encountering a persistent and recurring issue with my Ubuntu 22.04 LTS server where the NVIDIA GPU is consistently falling off the bus, as indicated by Xid 79 errors. This problem has occurred three times in the past three days.

System Environment:

  • Operating System: Ubuntu 22.04.4 LTS (x86_64)
  • Linux Kernel: 5.15.0-101-generic
  • NVIDIA Driver Version: 535.161.07 (installed via ubuntu-drivers autoinstall)
  • Kubernetes Version: v1.28.6 (set up with kubespray v2.24.1)
  • K8s NVIDIA Device Plugin: NVIDIA/k8s-device-plugin 0.14.5 (delivered via Helm chart)

GPU Usage:
The GPUs are employed by Nimble project miners, which perform computations as detailed on nimble.technology. The mining scripts in use are available at this GitHub repository.

Error Messages Encountered:

root@node1:~# nvidia-smi 
Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error

and from the dmesg output:

root@node1:~# uptime 
 08:38:19 up 15:34,  1 user,  load average: 6.49, 7.16, 7.32
root@node1:~# dmesg -T | grep NVRM
[Mon Apr  8 17:04:05 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
[Tue Apr  9 01:21:38 2024] NVRM: GPU at PCI:0000:a1:00: GPU-979426f2-893a-7cbb-c4cf-81472f89a462
[Tue Apr  9 01:21:38 2024] NVRM: Xid (PCI:0000:a1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Tue Apr  9 01:21:38 2024] NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus.
[Tue Apr  9 01:21:38 2024] NVRM: A GPU crash dump has been created. If possible, please run
                           NVRM: nvidia-bug-report.sh as root to collect this data before
                           NVRM: the NVIDIA kernel module is unloaded.

Normally nvidia-smi would report the following after reboot, until the Xid 79 issue cropped up:

root@node1:~# nvidia-smi | head -4
Tue Apr  9 09:13:25 2024       
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |

root@node1:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-bd1622d9-a72a-fe8a-20d9-f3a7304619e2)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-5cfcf04e-39e4-282b-7ebe-871356efad24)
GPU 2: NVIDIA GeForce RTX 4090 (UUID: GPU-e70a3ebe-3657-8c2e-27b1-4c6aa1692e1d)
GPU 3: NVIDIA GeForce RTX 4090 (UUID: GPU-69f7831c-980f-8343-5b34-9b2968469835)
GPU 4: NVIDIA GeForce RTX 4090 (UUID: GPU-71f7bd7b-7f03-0c44-f12b-ae3431838e80)
GPU 5: NVIDIA GeForce RTX 4090 (UUID: GPU-979426f2-893a-7cbb-c4cf-81472f89a462)
GPU 6: NVIDIA GeForce RTX 4090 (UUID: GPU-6ed8c811-d1e6-def4-65b0-3998d557c78c)
GPU 7: NVIDIA GeForce RTX 4090 (UUID: GPU-cd8321b3-b419-f85a-3e09-014eb899c8d5)

We have two more servers, the exact same configuration and GPU’s which do not experience this issue.

Steps Taken:
I have generated a bug report using nvidia-bug-report.sh and I am including the compressed log file (nvidia-bug-report.log.gz) with this post.

No recent system changes were made that could be directly linked to this issue. I have not yet been able to identify a specific action or event that triggers this error. The GPUs are under consistent load from the Nimble project miners when the issue occurs.

Please find the attached nvidia-bug-report.log.gz file for detailed system information and logs.

I am aware that personal information may be included in the bug report and consent to its use for the purpose of troubleshooting this issue.

Any guidance or assistance you can provide would be greatly appreciated.

Thank you for your time and help.

node1.pdx.nb.akash.pub_nvidia-bug-report.log.1.gz (1.6 MB)
node1.pdx.nb.akash.pub_nvidia-bug-report.log.2.gz (1.5 MB)

We’ve replaced the server (mainboard) & the GPU’s.
I’ll post here in case we encounter any nvidia-related issues.