I got a brand new PNY RTX 6000 Ada for ML workloads, and it seems to be crashing randomly and I can’t quite figure out why. I want to know if I have a bad card or maybe a bad configuration.
I ran sudo nvidia-bug-report.sh
before the bug was encountered (“didnt-die-yet”) and after it occurred on two separate occasions. I’ve attached the output here for all 3 runs:
nvidia-bug-report-2.log.gz (335.8 KB)
nvidia-bug-report-didnt-die-yet.log.gz (417.9 KB)
nvidia-bug-report.log.gz (315.3 KB)
As soon as I got it, I removed my RTX 3090 and upgraded from nvidia-565 drivers to nvidia-570. I tried both the open
and non-open variants.
# First time
sudo apt install nvidia-driver-570-server-open xserver-xorg-video-nvidia-570-server
# second time
sudo apt install --reinstall nvidia-driver-570-server xserver-xorg-video-nvidia-570-server
This seemed to work OK at first, and I was able to run nvidia-smi.
However, after a while my x server would freeze / become unresponsive. I could SSH into the machine, and then when I tried nvidia-smi
I would see:
Unable to determine the device handle for GPU0: 0000:08:00.0: Unknown Error
No devices were found
Then I tried journalctl -b | grep -i nvidia
:
Mar 07 16:40:58 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (--) NVIDIA(GPU-0):
Mar 07 16:40:58 nr200ubuntu systemd[3730]: Started app-gnome-nvidia\x2dsettings\x2dautostart-4537.scope - Application launched by gnome-session-binary.
Mar 07 16:40:59 nr200ubuntu /usr/libexec/gdm-x-session[3053]: (II) NVIDIA(GPU-0): Deleting GPU-0
E: Failed to fetch https://nvidia.github.io/libnvidia-container/stable/deb/amd64/InRelease
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.17.1'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Skipping resurvey of engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.15.3'. Reason: not selected
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Surveying selected engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Survey for engine 'llama.cpp-linux-x86_64-nvidia-cuda-avx2@1.18.0' took 277.61ms
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Best 'gguf' backend for detected to be 'llama.cpp-linux-x86_64-nvidia-cuda-avx2'
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BackendManager] Backend preferences file set for the first time: [{"model_format":"gguf","name":"llama.cpp-linux-x86_64-nvidia-cuda-avx2","version":"1.18.0"}]. Setting as last preferences for subscription.
Mar 07 16:41:37 nr200ubuntu LMStudio.desktop[7714]: [BundledDepsUnpackager] Finding latest version for runtime: llama.cpp-linux-x86_64-nvidia-cuda-avx2
Mar 07 16:43:00 nr200ubuntu Keybase[5400]: Warning: loader_scanned_icd_add: Could not get 'vkCreateInstance' via 'vk_icdGetInstanceProcAddr' for ICD libGLX_nvidia.so.0
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
Mar 07 16:45:35 nr200ubuntu /usr/libexec/gdm-x-session[3933]: (EE) NVIDIA(0): recover...
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
Mar 07 16:45:36 nr200ubuntu kernel: nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:6:0:0x0000000f
The kernel log says the device fell off the bus ???
/var/log/kern.log:
2025-03-03T02:07:36.128450-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-03T02:09:03.861458-08:00 nr200ubuntu kernel: message repeated 9 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-03T22:01:03.941742-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 565.57.01 Thu Oct 10 12:29:05 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 565.57.01 Thu Oct 10 12:02:00 UTC 2024
2025-03-03T22:01:03.941744-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-03T22:01:03.941766-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-03T22:01:10.723748-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T01:01:59.018107-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T01:01:59.169091-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-04T11:27:45.800552-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 565.57.01 Thu Oct 10 12:29:05 UTC 2024
2025-03-04T11:27:45.800554-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 565.57.01 Thu Oct 10 12:02:00 UTC 2024
2025-03-04T11:27:45.800555-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-04T11:27:45.800568-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-04T11:27:52.737480-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-04T11:28:33.415495-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-04T11:28:33.576484-08:00 nr200ubuntu kernel: message repeated 11 times: [ [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership]
2025-03-07T16:06:24.966484-08:00 nr200ubuntu kernel: [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000800] Failed to grab modeset ownership
2025-03-07T16:40:38.622890-08:00 nr200ubuntu kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
2025-03-07T16:40:38.622942-08:00 nr200ubuntu kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 570.86.15 Release Build (dvs-builder@U16-I2-C03-12-4) Thu Jan 23 22:50:36 UTC 2025
2025-03-07T16:40:38.622943-08:00 nr200ubuntu kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.86.15 Release Build (dvs-builder@U16-I2-C03-12-4) Thu Jan 23 22:33:58 UTC 2025
2025-03-07T16:40:38.622947-08:00 nr200ubuntu kernel: [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
2025-03-07T16:40:38.622998-08:00 nr200ubuntu kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 1
2025-03-07T16:40:38.623000-08:00 nr200ubuntu kernel: fbcon: nvidia-drmdrmfb (fb0) is primary device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia 0000:08:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
2025-03-07T16:40:38.623001-08:00 nr200ubuntu kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
2025-03-07T16:45:35.965762-08:00 nr200ubuntu kernel: NVRM: GPU at PCI:0000:08:00: GPU-1a0943be-1397-242a-9aa0-b8b66d01355c
2025-03-07T16:45:35.965777-08:00 nr200ubuntu kernel: NVRM: GPU Board Serial Number: 1795024031571
2025-03-07T16:45:35.965778-08:00 nr200ubuntu kernel: NVRM: Xid (PCI:0000:08:00): 79, GPU has fallen off the bus.
2025-03-07T16:45:35.965779-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: GPU 0000:08:00.0: GPU serial number is 1795024031571.
2025-03-07T16:45:35.965780-08:00 nr200ubuntu kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
2025-03-07T16:45:35.965782-08:00 nr200ubuntu kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
2025-03-07T16:45:35.965803-08:00 nr200ubuntu kernel: message repeated 38 times: [ NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!]
2025-03-07T16:45:35.965804-08:00 nr200ubuntu kernel: NVRM: prbEncStartAlloc: Can't allocate memory for protocol buffers.
2025-03-07T16:45:35.965807-08:00 nr200ubuntu kernel: NVRM: A GPU crash dump has been created. If possible, please run
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
2025-03-07T16:45:35.965808-08:00 nr200ubuntu kernel: NVRM: the NVIDIA kernel module is unloaded.
2025-03-07T16:45:35.965809-08:00 nr200ubuntu kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: GPU lost from the bus [NV_ERR_GPU_IS_LOST] (0x0000000F) returned from intrGetPendingStall_GM107(pGpu, pIntr, pEngines, pThreadState) @ intr_gp100.c:193
For more debugging info, I’m on Ubuntu 24.04 and kernel 6.8.0-55-generic
lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble
uname -r -a
Linux nr200ubuntu 6.8.0-55-generic #57-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 12 23:42:21 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux