Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus"

sam226 · March 8, 2025, 5:21pm

I got a brand new PNY RTX 6000 Ada for ML workloads, and it seems to be crashing randomly and I can’t quite figure out why. I want to know if I have a bad card or maybe a bad configuration.

I ran sudo nvidia-bug-report.sh before the bug was encountered (“didnt-die-yet”) and after it occurred on two separate occasions. I’ve attached the output here for all 3 runs:

nvidia-bug-report-2.log.gz (335.8 KB)
nvidia-bug-report-didnt-die-yet.log.gz (417.9 KB)
nvidia-bug-report.log.gz (315.3 KB)

As soon as I got it, I removed my RTX 3090 and upgraded from nvidia-565 drivers to nvidia-570. I tried both the open and non-open variants.

# First time
sudo apt install nvidia-driver-570-server-open xserver-xorg-video-nvidia-570-server

# second time
sudo apt install --reinstall nvidia-driver-570-server xserver-xorg-video-nvidia-570-server

This seemed to work OK at first, and I was able to run nvidia-smi.

However, after a while my x server would freeze / become unresponsive. I could SSH into the machine, and then when I tried nvidia-smi I would see:

Unable to determine the device handle for GPU0: 0000:08:00.0: Unknown Error
No devices were found

Then I tried journalctl -b | grep -i nvidia:

Mar 07 16:40:58 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (--) NVIDIA(GPU-0):
Mar 07 16:40:58 nr200ubuntu systemd[3730]: Started app-gnome-nvidia\x2dsettings\x2dautostart-4537.scope - Application launched by gnome-session-binary.
Mar 07 16:40:59 nr200ubuntu /usr/libexec/gdm-x-session[3053]: (II) NVIDIA(GPU-0): Deleting GPU-0
                                              E: Failed to fetch https://nvidia.github.io/libnvidia-container/stable/deb/amd64/InRelease
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.17.1'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.15.3'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Surveying selected engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Survey for engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0' took 277.61ms
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Best 'gguf' backend for detected to be 'llama.cpp-linux-x86_64-nvidia-cuda-avx2'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Backend preferences file set for the first time: [{"model_format":"gguf","name":"llama.cpp-linux-x86_64-nvidia-cuda-avx2","version":"1.18.0"}]. Setting as last preferences for subscription.
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BundledDepsUnpackager] Finding latest version for runtime: llama.cpp-linux-x86_64-nvidia-cuda-avx2
Mar 07 16:43:00 nr200ubuntu Keybase[5400]: Warning: loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0):     recover...
                                    NVRM: nvidia-bug-report.sh as root to collect this data before
                                    NVRM: the NVIDIA kernel module is unloaded.
Mar 07 16:45:36 nr200ubuntu kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f

The kernel log says the device fell off the bus ???

  /var/log/kern.log:
2025-03-03T02:07:36.128450-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-03T02:09:03.861458-08:00 nr200ubuntu kernel: message repeated 9 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-03T22:01:03.941742-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.57.01  Thu Oct 10 12:02:00 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-03T22:01:03.941766-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-03T22:01:10.723748-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T01:01:59.018107-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T01:01:59.169091-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-04T11:27:45.800552-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  565.57.01  Thu Oct 10 12:02:00 UTC 2024
2025-03-04T11:27:45.800555-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-04T11:27:45.800568-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-04T11:27:52.737480-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T11:28:33.415495-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T11:28:33.576484-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-07T16:06:24.966484-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-07T16:40:38.622890-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-07T16:40:38.622942-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  570.86.15  Release Build  (dvs-builder@U16-I2-C03-12-4)  Thu Jan 23 22:50:36 UTC 2025
2025-03-07T16:40:38.622943-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  570.86.15  Release Build  (dvs-builder@U16-I2-C03-12-4)  Thu Jan 23 22:33:58 UTC 2025
2025-03-07T16:40:38.622947-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-07T16:40:38.622998-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-07T16:40:38.623000-08:00 nr200ubuntu kernel: fbcon: nvidia-drmdrmfb (fb0) is primary device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia 0000:08:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-07T16:45:35.965762-08:00 nr200ubuntu kernel: NVRM: GPU at PCI:0000:08:00: GPU-1a0943be-1397-242a-9aa0-b8b66d01355c
2025-03-07T16:45:35.965777-08:00 nr200ubuntu kernel: NVRM: GPU Board Serial Number: 1795024031571
2025-03-07T16:45:35.965778-08:00 nr200ubuntu kernel: NVRM: Xid (PCI:0000:08:00): 79, GPU has fallen off the bus.
2025-03-07T16:45:35.965779-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU serial number is 1795024031571.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
2025-03-07T16:45:35.965782-08:00 nr200ubuntu kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
2025-03-07T16:45:35.965803-08:00 nr200ubuntu kernel: message repeated 38 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
2025-03-07T16:45:35.965804-08:00 nr200ubuntu kernel: NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers.
2025-03-07T16:45:35.965807-08:00 nr200ubuntu kernel: NVRM: A GPU crash dump has been created. If possible, please run
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: the NVIDIA kernel module is unloaded.
2025-03-07T16:45:35.965809-08:00 nr200ubuntu kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from intrGetPendingStall_GM107(pGpu, pIntr, pEngines, pThreadState) @ intr_gp100.c:193

For more debugging info, I’m on Ubuntu 24.04 and kernel 6.8.0-55-generic

lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble

uname -r -a
Linux nr200ubuntu 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

sam98 · March 8, 2025, 7:46pm

Update: ran the card with different power level restrictions.

280w
270w
250w

None of them seem to improve the situation. Heat at failure did not exceed 85C. I did fresh driver removal / reinstallation for each variant (open, non open). That it works for a while then dies seems like a hardware issue. I then load tested my RTX 3090 and it seems totally fine, so high probability this is a bad card. Requesting an RMA exchange.

sam226 · March 9, 2025, 11:58pm

OK waiting for RMA I tried some more stuff.

Using 550 drivers instead of 570. No dice.
I saw that my temp limits are weird? Negative numbers? Wtf is happening?

nvidia-smi -q -d TEMPERATURE


==============NVSMI LOG==============

Timestamp                                 : Sun Mar  9 16:54:33 2025
Driver Version                            : 550.144.03
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:08:00.0
    Temperature
        GPU Current Temp                  : 40 C
        GPU T.Limit Temp                  : 51 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 85 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A

sam226 · March 10, 2025, 7:55am

More reason to believe this is a hw issue.

I’m fairly certain this is not a thermal issue since the fans aren’t even blowing hard, the temps seem very sane, and I’ve been testing with a light workload.

Don’t think it’s a memory issue for what it’s worth – downloaded and ran a vram tester over SSH.

I also did some more environment ablations, where I purged and reinstalled xorg, blacklisted noveau harder / uninstalled any lib with noveau, tried 550, 570 drivers.

One more detail is: on my RTX 3090, I can see my grub boot screen. On RTX 6000, I cannot see the grub boot screen no matter what I tried. I added all kinds of stuff to my /etc/default/grub including forcing console, using nvidia modeset, nomodeset, different resolutions, etc (always updating initramfs and running update-grub). On the 3090, it always works. On the 6000, it does not.

I updated my BIOS in case there was mobo incompatibility. I’m running an Asus ROG Strix B550-i Gaming motherboard with a Corsair SF750 PSU. My system instability doesn’t seem to arise in a correlated way with power.

philathome · March 13, 2025, 8:19pm

Hi,
I’ll add my case - been doing it for a year or so, with varying frequency:
OLD:
Aug 3 21:19:00 kernel: [3411728.186124] NVRM: Xid (PCI:0000:09:00): 79, pid=177, GPU has fallen off the bus.
Aug 3 21:19:00 v kernel: [3411728.186130] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.

Today ;-)
(PCI:0000:2f:00): 79, pid=‘’, name=, GPU has fallen off the bus.
2025-03-13T12:38:19.351812-04:00 kernel: NVRM: GPU 0000:2f:00.0: GPU has fallen off the bus.
2025-03-13T12:38:19.351829-04:00 kernel: NVRM: GPU 0000:2f:00.0: GPU serial number is ******.

6.8.0-55-lowlatency #57.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 19 11:28:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

hardware is connected via thunderbolt in external GPU case.

==============NVSMI LOG==============

Timestamp : Thu Mar 13 16:15:52 2025
Driver Version : 550.120
CUDA Version : 12.4

Attached GPUs : 2
GPU 00000000:01:00.0
Temperature
GPU Current Temp : 50 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 102 C
GPU Target Temperature : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

GPU 00000000:2F:00.0
Temperature
GPU Current Temp : 43 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A

Can someone at NVIDIA please give a method of diagnosing this? The machine doesn’t always lock hard (X11 crashes , but sometimes access is available via ssh), so if there are commands that can be run before a reboot we can collect more data…

sam226 · March 14, 2025, 3:05am

I have similar symptoms; I can access my machine through SSH. I have an external GPU dock I can try with a clean ubuntu install to see if I get the same problem, will post results when I have them. If there’s no problem, it could be that there’s a bad interaction somewhere between nvidia drivers and X11…

My RMA is taking time ^^’

philathome · March 14, 2025, 5:54pm

I got mine from Lenovo - what are the RMA terms?

Does NVIDIA have a firmware fix or proper diagnostic?

sam226 · March 14, 2025, 6:31pm

I’m not sure what the RMA terms are, I gave them this thread as proof I did my homework. I ordered from CDW.com.

I haven’t found any diagnostics from nvidia and certainly if it’s a hardware issue there won’t be a fix.

philathome · March 14, 2025, 6:47pm

Thanks for the info. I bought in May 2022, so it would be nice if Nvidia would weigh in on such an expensive device!!!

philathome · May 24, 2025, 6:08pm

ok had another crash and the opportunity to get the debug log - irebooted and added irqpoll in case that makes any difference…

nvidia-bug-report.log-24MAY2025.gz (802.7 KB)

pdpino · July 14, 2025, 6:37pm

I’m having the same issue in a server with 8 GPUs (RTX Ada 6000), where one of them fails with the same issue (xid=79, GPU has fallen off the bus), without an apparent reason. Some details:

The issue has occurred many times for about 1 year, I’ve collected logs for almost all cases (I attach one below)
The GPU that fails never surpasses 75°C
I’ve reduced the power to 250W, to avoid temperature issues
Most of the times that it has failed, there was no process running on the GPU
Rebooting the server fixes the issue
I’m using Ubuntu 22.04.5
Drivers I’ve tried without success: 535, 550 and 570
A few months ago I upgraded the BIOS, but the problem persisted
For a long time it was the same GPU that failed, so I suspected there was something wrong with that GPU. However, recently I connected the 8th GPU (previously there were only 7 GPUs), and now the GPU that fails is a different one from before (but not the newly connected one). I’m still investigating this, so I’m not sure if it’s a problem with a specific GPU or not.

I attach the nvidia bug report of a recent failure:

nvidia-bug-report-2025-06-06_09-36-24.log.gz (1.6 MB)

morgwai666 · July 14, 2025, 7:01pm

So seems like either the driver or your mobo… Do you have the latest BIOS/UEFI for it? Also you can try shuffling the GPUs between the PCIe slots.

pdpino · July 15, 2025, 5:57pm

I upgraded to the latest BIOS a few months ago and I’ve also tried shuffling some GPUs between slots, but the problem persists

Topic		Replies	Views
GPU has fallen of the bus Linux	15	7488	July 19, 2019
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	2224	April 26, 2025
NVIDIA 515 - RTX 3060 - GPU has fallen off the bus Linux hw , nvbugs , kb	21	4960	March 15, 2025
NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus - HP Studio G5 Linux	39	10972	March 18, 2025
RTX 6000 Ada Linux driver crash GPU - Hardware inception	6	3758	April 26, 2023
Device not found (Ubuntu 20.04 / Dell Precision / RTX A4000 / RmInitAdapter failed) Linux ubuntu , nvidia-smi , dell	19	5715	August 14, 2022
NVIDIA RTX 3060 "Falls off the Bus" in current linux kernel with any nvidia driver (nouveau/nvidia/open) Linux	2	163	March 2, 2025
Gefore RTX 3060Ti repeatedly falls off bus Linux	3	1223	May 6, 2021
Keep getting "GPU has fallen off the bus" with 3090 cards on Gigabyte MZ32-AR1 Rev 3.0 motherboard Linux gaming	20	568	August 10, 2025
570 Random Freeze: GPU has fallen off the bus Linux	8	1100	May 15, 2025

Crash on RTX 6000 Ada on Ubuntu 24.04 "GPU has fallen off the bus"

Related topics