Hardware:
- GPU: RTX 6000 Pro Blackwell (sm120)
- Board Serial Number: 1791625015739
- Motherboard: System76 Thelio Mega / TRX50 AI TOP
- BIOS: F4 Z5 03/06/2025
Software Environment:
- OS: Ubuntu 24.04 (64-bit)
- Kernel: 6.14.0-22-generic
- Driver: 575.64.3 (released 2025-07-01)
- CUDA: 12.9
- Driver mode:
nvidia-open
(open kernel modules) - GSP: Enabled (required for open driver path)
Problem Description:
Whenever the GPU is placed under sustained load (e.g., running Ollama for LLM inference), the system experiences a full GPU failure. The error is always a GSP timeout at GSP_RM_CONTROL
(function 76), resulting in Xid 119, and requires a full system reboot to recover.
This is 100% reproducible — occurs consistently within 1–2 minutes of initiating load on the GPU.
Log Excerpt (from dmesg
):
php
CopyEdit
NVRM: Xid (PCI:0000:81:00): 119, pid=97581, name=ollama, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL)
...
NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!
Steps to Reproduce:
- Boot into Ubuntu 24.04 with kernel 6.14 and driver 575.64.3 (
nvidia-open
) - Run any sustained GPU compute task using Ollama or similar (loads LLM weights and runs inference)
- Wait for 30–90 seconds under load
- Observe Xid 119 GSP Timeout
- System enters unrecoverable GPU failure state — requires full reboot
What I’ve Tried:
- Tried with multiple kernel versions (6.14, 6.5, 6.1 LTS) — issue persists
- Driver versions 570.x and 575.x — all affected
- GSP cannot be disabled due to open driver requirement
- Verified full hardware compatibility and PCIe stability
- Confirmed all system firmware and BIOS are fully updated
Impact:
The GPU cannot be used for any production workload under Linux due to this bug. It appears to be a low-level GSP firmware or RPC handler issue under the nvidia-open
driver path, possibly triggered by specific workloads issuing rapid RPCs to GSP.
Bug Report Attachment:
nvidia-bug-report.log.gz (439.0 KB)
Attached
Additional Info:
If more debug steps are needed (e.g., enabling more verbose RPC traces or firmware dumps), I’m happy to test them.