Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus"

I got a brand new PNY RTX 6000 Ada for ML workloads, and it seems to be crashing randomly and I can’t quite figure out why. I want to know if I have a bad card or maybe a bad configuration.

I ran sudo nvidia-bug-report.sh before the bug was encountered (“didnt-die-yet”) and after it occurred on two separate occasions. I’ve attached the output here for all 3 runs:

nvidia-bug-report-2.log.gz (335.8 KB)
nvidia-bug-report-didnt-die-yet.log.gz (417.9 KB)
nvidia-bug-report.log.gz (315.3 KB)

As soon as I got it, I removed my RTX 3090 and upgraded from nvidia-565 drivers to nvidia-570. I tried both the open and non-open variants.

# First time
sudo apt install nvidia-driver-570-server-open xserver-xorg-video-nvidia-570-server

# second time
sudo apt install --reinstall nvidia-driver-570-server xserver-xorg-video-nvidia-570-server

This seemed to work OK at first, and I was able to run nvidia-smi.

However, after a while my x server would freeze / become unresponsive. I could SSH into the machine, and then when I tried nvidia-smi I would see:

Unable to determine the device handle for GPU0: 0000:08:00.0: Unknown Error
No devices were found

Then I tried journalctl -b | grep -i nvidia:

Mar 07 16:40:58 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (--) NVIDIA(GPU-0):
Mar 07 16:40:58 nr200ubuntu systemd[3730]: Started app-gnome-nvidia\x2dsettings\x2dautostart-4537.scope - Application launched by gnome-session-binary.
Mar 07 16:40:59 nr200ubuntu /usr/libexec/gdm-x-session[3053]: (II) NVIDIA(GPU-0): Deleting GPU-0
                                              E: Failed to fetch https://nvidia.github.io/libnvidia-container/stable/deb/amd64/InRelease
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.17.1'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.15.3'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Surveying selected engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Survey for engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0' took 277.61ms
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Best 'gguf' backend for detected to be 'llama.cpp-linux-x86_64-nvidia-cuda-avx2'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Backend preferences file set for the first time: [{"model_format":"gguf","name":"llama.cpp-linux-x86_64-nvidia-cuda-avx2","version":"1.18.0"}]. Setting as last preferences for subscription.
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BundledDepsUnpackager] Finding latest version for runtime: llama.cpp-linux-x86_64-nvidia-cuda-avx2
Mar 07 16:43:00 nr200ubuntu Keybase[5400]: Warning: loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0):     recover...
                                    NVRM: nvidia-bug-report.sh as root to collect this data before
                                    NVRM: the NVIDIA kernel module is unloaded.
Mar 07 16:45:36 nr200ubuntu kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f

The kernel log says the device fell off the bus ???

  /var/log/kern.log:
2025-03-03T02:07:36.128450-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-03T02:09:03.861458-08:00 nr200ubuntu kernel: message repeated 9 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-03T22:01:03.941742-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.57.01  Thu Oct 10 12:02:00 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-03T22:01:03.941766-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-03T22:01:10.723748-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T01:01:59.018107-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T01:01:59.169091-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-04T11:27:45.800552-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.57.01  Thu Oct 10 12:02:00 UTC 2024
2025-03-04T11:27:45.800555-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-04T11:27:45.800568-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-04T11:27:52.737480-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T11:28:33.415495-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T11:28:33.576484-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-07T16:06:24.966484-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-07T16:40:38.622890-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-07T16:40:38.622942-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  570.86.15  Release Build  (dvs-builder@U16-I2-C03-12-4)  Thu Jan 23 22:50:36 UTC 2025
2025-03-07T16:40:38.622943-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  570.86.15  Release Build  (dvs-builder@U16-I2-C03-12-4)  Thu Jan 23 22:33:58 UTC 2025
2025-03-07T16:40:38.622947-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-07T16:40:38.622998-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-07T16:40:38.623000-08:00 nr200ubuntu kernel: fbcon: nvidia-drmdrmfb (fb0) is primary device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia 0000:08:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-07T16:45:35.965762-08:00 nr200ubuntu kernel: NVRM: GPU at PCI:0000:08:00: GPU-1a0943be-1397-242a-9aa0-b8b66d01355c
2025-03-07T16:45:35.965777-08:00 nr200ubuntu kernel: NVRM: GPU Board Serial Number: 1795024031571
2025-03-07T16:45:35.965778-08:00 nr200ubuntu kernel: NVRM: Xid (PCI:0000:08:00): 79, GPU has fallen off the bus.
2025-03-07T16:45:35.965779-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU serial number is 1795024031571.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
2025-03-07T16:45:35.965782-08:00 nr200ubuntu kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
2025-03-07T16:45:35.965803-08:00 nr200ubuntu kernel: message repeated 38 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
2025-03-07T16:45:35.965804-08:00 nr200ubuntu kernel: NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers.
2025-03-07T16:45:35.965807-08:00 nr200ubuntu kernel: NVRM: A GPU crash dump has been created. If possible, please run
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: the NVIDIA kernel module is unloaded.
2025-03-07T16:45:35.965809-08:00 nr200ubuntu kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from intrGetPendingStall_GM107(pGpu, pIntr, pEngines, pThreadState) @ intr_gp100.c:193 

For more debugging info, I’m on Ubuntu 24.04 and kernel 6.8.0-55-generic

lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble
uname -r -a
Linux nr200ubuntu 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
1 Like

Update: ran the card with different power level restrictions.

280w
270w
250w

None of them seem to improve the situation. Heat at failure did not exceed 85C. I did fresh driver removal / reinstallation for each variant (open, non open). That it works for a while then dies seems like a hardware issue. I then load tested my RTX 3090 and it seems totally fine, so high probability this is a bad card. Requesting an RMA exchange.

OK waiting for RMA I tried some more stuff.

  1. Using 550 drivers instead of 570. No dice.

  2. I saw that my temp limits are weird? Negative numbers? Wtf is happening?

nvidia-smi -q -d TEMPERATURE


==============NVSMI LOG==============

Timestamp                                 : Sun Mar  9 16:54:33 2025
Driver Version                            : 550.144.03
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:08:00.0
    Temperature
        GPU Current Temp                  : 40 C
        GPU T.Limit Temp                  : 51 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 85 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A

More reason to believe this is a hw issue.

I’m fairly certain this is not a thermal issue since the fans aren’t even blowing hard, the temps seem very sane, and I’ve been testing with a light workload.

Don’t think it’s a memory issue for what it’s worth – downloaded and ran a vram tester over SSH.

I also did some more environment ablations, where I purged and reinstalled xorg, blacklisted noveau harder / uninstalled any lib with noveau, tried 550, 570 drivers.

One more detail is: on my RTX 3090, I can see my grub boot screen. On RTX 6000, I cannot see the grub boot screen no matter what I tried. I added all kinds of stuff to my /etc/default/grub including forcing console, using nvidia modeset, nomodeset, different resolutions, etc (always updating initramfs and running update-grub). On the 3090, it always works. On the 6000, it does not.

I updated my BIOS in case there was mobo incompatibility. I’m running an Asus ROG Strix B550-i Gaming motherboard with a Corsair SF750 PSU. My system instability doesn’t seem to arise in a correlated way with power.

Hi,
I’ll add my case - been doing it for a year or so, with varying frequency:
OLD:
Aug 3 21:19:00 kernel: [3411728.186124] NVRM: Xid (PCI:0000:09:00): 79, pid=177, GPU has fallen off the bus.
Aug 3 21:19:00 v kernel: [3411728.186130] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.

Today ;-)
(PCI:0000:2f:00): 79, pid=‘’, name=, GPU has fallen off the bus.
2025-03-13T12:38:19.351812-04:00 kernel: NVRM: GPU 0000:2f:00.0: GPU has fallen off the bus.
2025-03-13T12:38:19.351829-04:00 kernel: NVRM: GPU 0000:2f:00.0: GPU serial number is ******.

6.8.0-55-lowlatency #57.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 19 11:28:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Thu Mar 13 16:15:02 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro T2000 Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P8 3W / 60W | 10MiB / 4096MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:2F:00.0 On | 0 |
| 30% 43C P5 38W / 300W | 7856MiB / 46068MiB | 30% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

hardware is connected via thunderbolt in external GPU case.

==============NVSMI LOG==============

Timestamp : Thu Mar 13 16:15:52 2025
Driver Version : 550.120
CUDA Version : 12.4

Attached GPUs : 2
GPU 00000000:01:00.0
Temperature
GPU Current Temp : 50 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 102 C
GPU Target Temperature : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

GPU 00000000:2F:00.0
Temperature
GPU Current Temp : 43 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

Can someone at NVIDIA please give a method of diagnosing this? The machine doesn’t always lock hard (X11 crashes , but sometimes access is available via ssh), so if there are commands that can be run before a reboot we can collect more data…

I have similar symptoms; I can access my machine through SSH. I have an external GPU dock I can try with a clean ubuntu install to see if I get the same problem, will post results when I have them. If there’s no problem, it could be that there’s a bad interaction somewhere between nvidia drivers and X11…

My RMA is taking time ^^’

I got mine from Lenovo - what are the RMA terms?

Does NVIDIA have a firmware fix or proper diagnostic?

I’m not sure what the RMA terms are, I gave them this thread as proof I did my homework. I ordered from CDW.com.

I haven’t found any diagnostics from nvidia and certainly if it’s a hardware issue there won’t be a fix.

Thanks for the info. I bought in May 2022, so it would be nice if Nvidia would weigh in on such an expensive device!!!

ok had another crash and the opportunity to get the debug log - irebooted and added irqpoll in case that makes any difference…

nvidia-bug-report.log-24MAY2025.gz (802.7 KB)

I’m having the same issue in a server with 8 GPUs (RTX Ada 6000), where one of them fails with the same issue (xid=79, GPU has fallen off the bus), without an apparent reason. Some details:

  • The issue has occurred many times for about 1 year, I’ve collected logs for almost all cases (I attach one below)
  • The GPU that fails never surpasses 75°C
  • I’ve reduced the power to 250W, to avoid temperature issues
  • Most of the times that it has failed, there was no process running on the GPU
  • Rebooting the server fixes the issue
  • I’m using Ubuntu 22.04.5
  • Drivers I’ve tried without success: 535, 550 and 570
  • A few months ago I upgraded the BIOS, but the problem persisted
  • For a long time it was the same GPU that failed, so I suspected there was something wrong with that GPU. However, recently I connected the 8th GPU (previously there were only 7 GPUs), and now the GPU that fails is a different one from before (but not the newly connected one). I’m still investigating this, so I’m not sure if it’s a problem with a specific GPU or not.

I attach the nvidia bug report of a recent failure:

nvidia-bug-report-2025-06-06_09-36-24.log.gz (1.6 MB)

2 Likes

So seems like either the driver or your mobo… Do you have the latest BIOS/UEFI for it? Also you can try shuffling the GPUs between the PCIe slots.

I upgraded to the latest BIOS a few months ago and I’ve also tried shuffling some GPUs between slots, but the problem persists