Hello NVIDIA community and support team,
I am encountering a persistent and recurring issue with my Ubuntu 22.04 LTS server where the NVIDIA GPU is consistently falling off the bus, as indicated by Xid 79 errors. This problem has occurred three times in the past three days.
System Environment:
- Operating System: Ubuntu 22.04.4 LTS (x86_64)
- Linux Kernel: 5.15.0-101-generic
- NVIDIA Driver Version: 535.161.07 (installed via
ubuntu-drivers autoinstall
) - Kubernetes Version: v1.28.6 (set up with kubespray v2.24.1)
- K8s NVIDIA Device Plugin: NVIDIA/k8s-device-plugin 0.14.5 (delivered via Helm chart)
GPU Usage:
The GPUs are employed by Nimble project miners, which perform computations as detailed on nimble.technology. The mining scripts in use are available at this GitHub repository.
Error Messages Encountered:
root@node1:~# nvidia-smi
Unable to determine the device handle for GPU0000:A1:00.0: Unknown Error
and from the dmesg
output:
root@node1:~# uptime
08:38:19 up 15:34, 1 user, load average: 6.49, 7.16, 7.32
root@node1:~# dmesg -T | grep NVRM
[Mon Apr 8 17:04:05 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.161.07 Sat Feb 17 22:55:48 UTC 2024
[Tue Apr 9 01:21:38 2024] NVRM: GPU at PCI:0000:a1:00: GPU-979426f2-893a-7cbb-c4cf-81472f89a462
[Tue Apr 9 01:21:38 2024] NVRM: Xid (PCI:0000:a1:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Tue Apr 9 01:21:38 2024] NVRM: GPU 0000:a1:00.0: GPU has fallen off the bus.
[Tue Apr 9 01:21:38 2024] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Normally nvidia-smi
would report the following after reboot, until the Xid 79
issue cropped up:
root@node1:~# nvidia-smi | head -4
Tue Apr 9 09:13:25 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
root@node1:~# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-bd1622d9-a72a-fe8a-20d9-f3a7304619e2)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-5cfcf04e-39e4-282b-7ebe-871356efad24)
GPU 2: NVIDIA GeForce RTX 4090 (UUID: GPU-e70a3ebe-3657-8c2e-27b1-4c6aa1692e1d)
GPU 3: NVIDIA GeForce RTX 4090 (UUID: GPU-69f7831c-980f-8343-5b34-9b2968469835)
GPU 4: NVIDIA GeForce RTX 4090 (UUID: GPU-71f7bd7b-7f03-0c44-f12b-ae3431838e80)
GPU 5: NVIDIA GeForce RTX 4090 (UUID: GPU-979426f2-893a-7cbb-c4cf-81472f89a462)
GPU 6: NVIDIA GeForce RTX 4090 (UUID: GPU-6ed8c811-d1e6-def4-65b0-3998d557c78c)
GPU 7: NVIDIA GeForce RTX 4090 (UUID: GPU-cd8321b3-b419-f85a-3e09-014eb899c8d5)
Additionally
We have two more servers, the exact same configuration and GPU’s which do not experience this issue.
Steps Taken:
I have generated a bug report using nvidia-bug-report.sh
and I am including the compressed log file (nvidia-bug-report.log.gz) with this post.
No recent system changes were made that could be directly linked to this issue. I have not yet been able to identify a specific action or event that triggers this error. The GPUs are under consistent load from the Nimble project miners when the issue occurs.
Please find the attached nvidia-bug-report.log.gz file for detailed system information and logs.
I am aware that personal information may be included in the bug report and consent to its use for the purpose of troubleshooting this issue.
Any guidance or assistance you can provide would be greatly appreciated.
Thank you for your time and help.
node1.pdx.nb.akash.pub_nvidia-bug-report.log.1.gz (1.6 MB)
node1.pdx.nb.akash.pub_nvidia-bug-report.log.2.gz (1.5 MB)