Hello, I am kind of desperate now after 2 weeks trying to debug problems with this card while running some models with vllm.
Some times just booting vllm , sometimes after a couple of request happen, my nvidia card gets stuck.
As English is not my native tongue I’ve used the help of Claude to describe my problem in this message.
System Information
-
GPU: NVIDIA RTX 5090
-
Driver: nvidia-driver-580-open (version 580.95.05-0ubuntu1)
-
OS: Ubuntu 24.04 LTS
-
Kernel: 6.8.0-87-generic #88-Ubuntu
-
Motherboard: ASUS TUF GAMING B860-PLUS WIFI (BIOS 1405, dated 05/06/2025)
-
CPU: Intel (MTL-based system)
Problem Description
Looking at dmesg I have been able to reach a point where the problems seems to be related to kernel drivers.
I am experiencing consistent GPU crashes when attempting to create embeddings or run CUDA operations on my RTX 5090. The driver repeatedly crashes with Xid 13 errors (Graphics SM Warp Exception - Illegal Instruction Encoding).
Initially, the system also showed Xid 119 (GSP Timeout) errors, which resolved only after disabling the Graphics System Processor (GSP) firmware. However, disabling GSP revealed the underlying Xid 13 errors, indicating a deeper issue with instruction encoding for this GPU architecture.
Error Messages
Current Errors (GSP disabled):
[ 271.603963] NVRM: Xid (PCI:0000:02:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding
[ 271.603999] NVRM: Xid (PCI:0000:02:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors
[ 271.604012] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x505730=0x9 0x505734=0x4 0x505728=0x1c81fb60 0x50572c=0x1174
[ 271.657496] NVRM: Xid (PCI:0000:02:00): 13, pid=5321, name=VLLM::EngineCor, Graphics Exception: channel 0x00000002, Class 0000cec0, Offset 00000510, Data 0017e2ac
Previous Errors (with GSP enabled):
[ 989.201227] NVRM: Xid (PCI:0000:02:00): 119, pid=5742, name=python, Timeout after 45s of waiting for RPC response from GPU0 GSP!
[ 989.201241] NVRM: GPU0 RPC history shows repeated FREE operations timing out
[ 1079.206286] NVRM: Xid (PCI:0000:02:00): 119, Timeout after 45s waiting for RPC response
[ 1079.206293] NVRM: nvAssertFailedNoLog: Assertion failed: Back to back GSP RPC timeout detected! GPU marked for reset
[dmesg.txt|attachment](upload://9Zk8rzKSK8kjPmhDOv3tKY2okc2.txt) (238.7 KB)
Troubleshooting Steps Already Attempted
-
✅ Updated to latest nvidia-driver-580-open (580.95.05)
-
✅ Updated CUDA to version 12.x
-
✅ Tried PyTorch with cu124 builds
-
✅ Set CUDA_LAUNCH_BLOCKING=1 and TORCH_USE_CUDA_DSA=1
-
✅ Disabled GSP firmware (temporarily resolved Xid 119, revealed Xid 13)
-
✅ Verified GPU detection with nvidia-smi (GPU correctly identified)
-
✅ Attempted firmware refresh with fwupdmgr
-
❌ Cannot use proprietary driver (not compatible with RTX 5090)
Key Observations
-
The RTX 5090 uses compute capability 9.0
-
Xid 13 errors occur in Graphics SM (Streaming Multiprocessor) units
-
GSP firmware appears to have communication or compatibility issues
-
Disabling GSP does not resolve the underlying instruction encoding problem
-
The error suggests kernel code compiled for compute capability 9.0 is not compatible with the actual GPU hardware or firmware state
Questions
-
Is there a known issue with nvidia-driver-580-open and RTX 5090 on Linux?
-
Are there updated GSP firmware binaries specifically for RTX 5090 that should be installed separately?
-
Is compute capability 9.0 correctly supported in the 580-open driver for RTX 5090?
Attachments
-
Full dmesg log showing Xid 13 and 119 errors
-
nvidia-smi output
-
Kernel and driver version information
nvidia-smi
Sat Nov 1 09:47:35 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:02:00.0 Off | N/A |
| 0% 31C P8 2W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
ii libnvidia-encode-580:amd64 580.95.05-0ubuntu1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-580:amd64 580.95.05-0ubuntu1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-580:amd64 580.95.05-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-580:amd64 580.95.05-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gpucomp-580:amd64 580.95.05-0ubuntu1 amd64 NVIDIA binary GPU compiler library
ii libnvidia-ml-dev:amd64 12.0.140~12.0.1-4build4 amd64 NVIDIA Management Library (NVML) development files
ii nvidia-container-toolkit 1.18.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.18.0-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-cuda-dev:amd64 12.0.146~12.0.1-4build4 amd64 NVIDIA CUDA development files
ii nvidia-cuda-toolkit 12.0.140~12.0.1-4build4 amd64 NVIDIA CUDA development toolkit
ii nvidia-dkms-580-open 580.95.05-0ubuntu1 amd64 NVIDIA DKMS package (open kernel module)
ii nvidia-driver-580-open 580.95.05-0ubuntu1 amd64 NVIDIA driver (open kernel) metapackage
ii nvidia-firmware-580 580.95.05-0ubuntu1 amd64 Firmware files used by the kernel module
ii nvidia-kernel-common-580 580.95.05-0ubuntu1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-580-open 580.95.05-0ubuntu1 amd64 NVIDIA kernel source package
ii nvidia-modprobe 580.95.05-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-dev:amd64 12.0.140~12.0.1-4build4 amd64 NVIDIA OpenCL development files
ii nvidia-persistenced 580.95.05-0ubuntu1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-profiler 12.0.146~12.0.1-4build4 amd64 NVIDIA Profiler for CUDA and OpenCL
ii xserver-xorg-video-nvidia-580 580.95.05-0ubuntu1 amd64 NVIDIA binary Xorg driver
$ uname -a
Linux iasantiago 6.8.0-87-generic #88-Ubuntu SMP PREEMPT_DYNAMIC Sat Oct 11 09:28:41 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux