Xid 119 GSP Timeout on RTX 6000 Pro Blackwell (575.64.3) under Load – Reproducible Crash

Hardware:

  • GPU: RTX 6000 Pro Blackwell (sm120)
  • Board Serial Number: 1791625015739
  • Motherboard: System76 Thelio Mega / TRX50 AI TOP
  • BIOS: F4 Z5 03/06/2025

Software Environment:

  • OS: Ubuntu 24.04 (64-bit)
  • Kernel: 6.14.0-22-generic
  • Driver: 575.64.3 (released 2025-07-01)
  • CUDA: 12.9
  • Driver mode: nvidia-open (open kernel modules)
  • GSP: Enabled (required for open driver path)

Problem Description:
Whenever the GPU is placed under sustained load (e.g., running Ollama for LLM inference), the system experiences a full GPU failure. The error is always a GSP timeout at GSP_RM_CONTROL (function 76), resulting in Xid 119, and requires a full system reboot to recover.

This is 100% reproducible — occurs consistently within 1–2 minutes of initiating load on the GPU.

Log Excerpt (from dmesg):

php

CopyEdit

NVRM: Xid (PCI:0000:81:00): 119, pid=97581, name=ollama, Timeout after 45s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL)
...
NVRM: _issueRpcAndWait: rpcRecvPoll timedout for fn 76!

Steps to Reproduce:

  1. Boot into Ubuntu 24.04 with kernel 6.14 and driver 575.64.3 (nvidia-open)
  2. Run any sustained GPU compute task using Ollama or similar (loads LLM weights and runs inference)
  3. Wait for 30–90 seconds under load
  4. Observe Xid 119 GSP Timeout
  5. System enters unrecoverable GPU failure state — requires full reboot

What I’ve Tried:

  • Tried with multiple kernel versions (6.14, 6.5, 6.1 LTS) — issue persists
  • Driver versions 570.x and 575.x — all affected
  • GSP cannot be disabled due to open driver requirement
  • Verified full hardware compatibility and PCIe stability
  • Confirmed all system firmware and BIOS are fully updated

Impact:
The GPU cannot be used for any production workload under Linux due to this bug. It appears to be a low-level GSP firmware or RPC handler issue under the nvidia-open driver path, possibly triggered by specific workloads issuing rapid RPCs to GSP.

Bug Report Attachment:
nvidia-bug-report.log.gz (439.0 KB)

Attached

Additional Info:
If more debug steps are needed (e.g., enabling more verbose RPC traces or firmware dumps), I’m happy to test them.