Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus"

I got a brand new PNY RTX 6000 Ada for ML workloads, and it seems to be crashing randomly and I can’t quite figure out why. I want to know if I have a bad card or maybe a bad configuration.

I ran sudo nvidia-bug-report.sh before the bug was encountered (“didnt-die-yet”) and after it occurred on two separate occasions. I’ve attached the output here for all 3 runs:

nvidia-bug-report-2.log.gz (335.8 KB)
nvidia-bug-report-didnt-die-yet.log.gz (417.9 KB)
nvidia-bug-report.log.gz (315.3 KB)

As soon as I got it, I removed my RTX 3090 and upgraded from nvidia-565 drivers to nvidia-570. I tried both the open and non-open variants.

# First time
sudo apt install nvidia-driver-570-server-open xserver-xorg-video-nvidia-570-server

# second time
sudo apt install --reinstall nvidia-driver-570-server xserver-xorg-video-nvidia-570-server

This seemed to work OK at first, and I was able to run nvidia-smi.

However, after a while my x server would freeze / become unresponsive. I could SSH into the machine, and then when I tried nvidia-smi I would see:

Unable to determine the device handle for GPU0: 0000:08:00.0: Unknown Error
No devices were found

Then I tried journalctl -b | grep -i nvidia:

Mar 07 16:40:58 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (--) NVIDIA(GPU-0):
Mar 07 16:40:58 nr200ubuntu systemd[3730]: Started app-gnome-nvidia\x2dsettings\x2dautostart-4537.scope - Application launched by gnome-session-binary.
Mar 07 16:40:59 nr200ubuntu /usr/libexec/gdm-x-session[3053]: (II) NVIDIA(GPU-0): Deleting GPU-0
                                              E: Failed to fetch https://nvidia.github.io/libnvidia-container/stable/deb/amd64/InRelease
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.17.1'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.15.3'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Surveying selected engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Survey for engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0' took 277.61ms
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Best 'gguf' backend for detected to be 'llama.cpp-linux-x86_64-nvidia-cuda-avx2'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Backend preferences file set for the first time: [{"model_format":"gguf","name":"llama.cpp-linux-x86_64-nvidia-cuda-avx2","version":"1.18.0"}]. Setting as last preferences for subscription.
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BundledDepsUnpackager] Finding latest version for runtime: llama.cpp-linux-x86_64-nvidia-cuda-avx2
Mar 07 16:43:00 nr200ubuntu Keybase[5400]: Warning: loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0):     recover...
                                    NVRM: nvidia-bug-report.sh as root to collect this data before
                                    NVRM: the NVIDIA kernel module is unloaded.
Mar 07 16:45:36 nr200ubuntu kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f

The kernel log says the device fell off the bus ???

  /var/log/kern.log:
2025-03-03T02:07:36.128450-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-03T02:09:03.861458-08:00 nr200ubuntu kernel: message repeated 9 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-03T22:01:03.941742-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.57.01  Thu Oct 10 12:02:00 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-03T22:01:03.941766-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-03T22:01:10.723748-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T01:01:59.018107-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T01:01:59.169091-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-04T11:27:45.800552-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.57.01  Thu Oct 10 12:02:00 UTC 2024
2025-03-04T11:27:45.800555-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-04T11:27:45.800568-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-04T11:27:52.737480-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T11:28:33.415495-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T11:28:33.576484-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-07T16:06:24.966484-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-07T16:40:38.622890-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-07T16:40:38.622942-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  570.86.15  Release Build  (dvs-builder@U16-I2-C03-12-4)  Thu Jan 23 22:50:36 UTC 2025
2025-03-07T16:40:38.622943-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  570.86.15  Release Build  (dvs-builder@U16-I2-C03-12-4)  Thu Jan 23 22:33:58 UTC 2025
2025-03-07T16:40:38.622947-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-07T16:40:38.622998-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-07T16:40:38.623000-08:00 nr200ubuntu kernel: fbcon: nvidia-drmdrmfb (fb0) is primary device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia 0000:08:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-07T16:45:35.965762-08:00 nr200ubuntu kernel: NVRM: GPU at PCI:0000:08:00: GPU-1a0943be-1397-242a-9aa0-b8b66d01355c
2025-03-07T16:45:35.965777-08:00 nr200ubuntu kernel: NVRM: GPU Board Serial Number: 1795024031571
2025-03-07T16:45:35.965778-08:00 nr200ubuntu kernel: NVRM: Xid (PCI:0000:08:00): 79, GPU has fallen off the bus.
2025-03-07T16:45:35.965779-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU serial number is 1795024031571.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
2025-03-07T16:45:35.965782-08:00 nr200ubuntu kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
2025-03-07T16:45:35.965803-08:00 nr200ubuntu kernel: message repeated 38 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
2025-03-07T16:45:35.965804-08:00 nr200ubuntu kernel: NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers.
2025-03-07T16:45:35.965807-08:00 nr200ubuntu kernel: NVRM: A GPU crash dump has been created. If possible, please run
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: the NVIDIA kernel module is unloaded.
2025-03-07T16:45:35.965809-08:00 nr200ubuntu kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from intrGetPendingStall_GM107(pGpu, pIntr, pEngines, pThreadState) @ intr_gp100.c:193 

For more debugging info, I’m on Ubuntu 24.04 and kernel 6.8.0-55-generic

lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble
uname -r -a
Linux nr200ubuntu 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
1 Like

Update: ran the card with different power level restrictions.

280w
270w
250w

None of them seem to improve the situation. Heat at failure did not exceed 85C. I did fresh driver removal / reinstallation for each variant (open, non open). That it works for a while then dies seems like a hardware issue. I then load tested my RTX 3090 and it seems totally fine, so high probability this is a bad card. Requesting an RMA exchange.

OK waiting for RMA I tried some more stuff.

  1. Using 550 drivers instead of 570. No dice.

  2. I saw that my temp limits are weird? Negative numbers? Wtf is happening?

nvidia-smi -q -d TEMPERATURE


==============NVSMI LOG==============

Timestamp                                 : Sun Mar  9 16:54:33 2025
Driver Version                            : 550.144.03
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:08:00.0
    Temperature
        GPU Current Temp                  : 40 C
        GPU T.Limit Temp                  : 51 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 85 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A

More reason to believe this is a hw issue.

I’m fairly certain this is not a thermal issue since the fans aren’t even blowing hard, the temps seem very sane, and I’ve been testing with a light workload.

Don’t think it’s a memory issue for what it’s worth – downloaded and ran a vram tester over SSH.

I also did some more environment ablations, where I purged and reinstalled xorg, blacklisted noveau harder / uninstalled any lib with noveau, tried 550, 570 drivers.

One more detail is: on my RTX 3090, I can see my grub boot screen. On RTX 6000, I cannot see the grub boot screen no matter what I tried. I added all kinds of stuff to my /etc/default/grub including forcing console, using nvidia modeset, nomodeset, different resolutions, etc (always updating initramfs and running update-grub). On the 3090, it always works. On the 6000, it does not.

I updated my BIOS in case there was mobo incompatibility. I’m running an Asus ROG Strix B550-i Gaming motherboard with a Corsair SF750 PSU. My system instability doesn’t seem to arise in a correlated way with power.

Hi,
I’ll add my case - been doing it for a year or so, with varying frequency:
OLD:
Aug 3 21:19:00 kernel: [3411728.186124] NVRM: Xid (PCI:0000:09:00): 79, pid=177, GPU has fallen off the bus.
Aug 3 21:19:00 v kernel: [3411728.186130] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.

Today ;-)
(PCI:0000:2f:00): 79, pid=‘’, name=, GPU has fallen off the bus.
2025-03-13T12:38:19.351812-04:00 kernel: NVRM: GPU 0000:2f:00.0: GPU has fallen off the bus.
2025-03-13T12:38:19.351829-04:00 kernel: NVRM: GPU 0000:2f:00.0: GPU serial number is ******.

6.8.0-55-lowlatency #57.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 19 11:28:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Thu Mar 13 16:15:02 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro T2000 Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P8 3W / 60W | 10MiB / 4096MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:2F:00.0 On | 0 |
| 30% 43C P5 38W / 300W | 7856MiB / 46068MiB | 30% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

hardware is connected via thunderbolt in external GPU case.

==============NVSMI LOG==============

Timestamp : Thu Mar 13 16:15:52 2025
Driver Version : 550.120
CUDA Version : 12.4

Attached GPUs : 2
GPU 00000000:01:00.0
Temperature
GPU Current Temp : 50 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 102 C
GPU Target Temperature : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

GPU 00000000:2F:00.0
Temperature
GPU Current Temp : 43 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

Can someone at NVIDIA please give a method of diagnosing this? The machine doesn’t always lock hard (X11 crashes , but sometimes access is available via ssh), so if there are commands that can be run before a reboot we can collect more data…

I have similar symptoms; I can access my machine through SSH. I have an external GPU dock I can try with a clean ubuntu install to see if I get the same problem, will post results when I have them. If there’s no problem, it could be that there’s a bad interaction somewhere between nvidia drivers and X11…

My RMA is taking time ^^’

I got mine from Lenovo - what are the RMA terms?

Does NVIDIA have a firmware fix or proper diagnostic?

I’m not sure what the RMA terms are, I gave them this thread as proof I did my homework. I ordered from CDW.com.

I haven’t found any diagnostics from nvidia and certainly if it’s a hardware issue there won’t be a fix.

Thanks for the info. I bought in May 2022, so it would be nice if Nvidia would weigh in on such an expensive device!!!