RTX 5070 Ti (Blackwell) — Xid 79 / Xid 119 cascade / silent idle hangs on 580.159.03 open module

# Bug report — RTX 5070 Ti intermittent crashes (Xid 79, Xid 119 cascade, silent hangs)

## TL;DR

System76 Thelio Mira (Ryzen 9 9950X + RTX 5070 Ti, Blackwell GB203) on Pop!_OS 22.04 with NVIDIA 580.x open kernel module. **Intermittent crashes since the machine was new (~9–10 months).** Three distinct failure signatures observed; all reproducible across driver bumps 580.82 → 580.119 → 580.126 → 580.159 and across kernels 6.12 → 6.16 → 6.17. Power-management workarounds (DPM=0, persistence mode, `pcie_aspm=off`) have been tried in combination and **do not prevent the workload-triggered fall-off-the-bus mode**.

Detailed kernel logs are available only from **2026-05-17 onward** due to log rotation; earlier crashes are observed by the user but not captured.

## System

| | |

|—|—|

| Vendor / model | System76 Thelio Mira (r4) |

| BIOS | American Megatrends 4.10.SP01, 2026-02-23 |

| CPU | AMD Ryzen 9 9950X (16-core) |

| GPU | NVIDIA GeForce RTX 5070 Ti (Blackwell, GB203-A) — PCI ID `10de:2c05`, subsystem `1458:41a8` (Gigabyte) |

| VBIOS | 98.03.3B.80.C6 |

| GPU memory | 16 GB |

| Display path | **AMD iGPU drives the display.** NVIDIA card is idle except for compute. |

| OS | Pop!_OS 22.04 |

| Kernel (current) | 6.17.9-76061709-generic (System76 build, kernel-config `202511241048`) |

| NVIDIA driver (current) | 580.159.03 (open kernel module) |

| OS install date | 2025-08-09 (rootfs / machine-id birth) |

## Driver / kernel timeline (apt history)

Drivers used on this machine — all 580-series so far have been the **open** kernel module:

| Date | NVIDIA driver | Kernel |

|—|—|—|

| 2025-07-02 | nvidia-driver-570-open 570.153.02 | linux-system76 6.12.10 |

| 2025-08-09 | (same driver) | 6.12.10 (rebuild) |

| 2025-09-10 | (same driver) | **6.16.3** (major bump) |

| 2025-09-22 / 09-25 | 570.153.02 (rebuilds) | 6.16.3 (rebuilds) |

| **2025-10-08** | **transition to 580.82.09-open** | 6.16.3 |

| 2025-10-10 / 10-11 | 580.82.09 (rebuilds) | 6.16.3 |

| 2025-10-21 | — | 6.16.3 (rebuild) |

| 2025-11-11 | — | **6.17.4** |

| 2026-01-07 | — | **6.17.9** |

| 2026-01-15 | **580.119.02** | 6.17.9 |

| 2026-03-19 | **580.126.18** | 6.17.9 |

| 2026-05-04 | **580.159.03** | 6.17.9 (rebuild) |

| 2026-05-17 | 580.159.03 | 6.17.9 (rebuild) |

Crashes have been intermittent across **every** combination listed above per the user’s recollection. The recent (May 17–18) cluster captured in logs occurred on 580.159.03 + 6.17.9.

## Failure signatures

Three distinct kernel-log signatures observed:

### A. Silent hang at idle

- No Xid logged.

- Journal truncates mid-stream; no errors, no panic, no NVRM message.

- Recovery: hard power-cycle required.

- Triggered: with the machine sitting idle on the desktop. The display is on the AMD iGPU, so the NVIDIA card is genuinely idle when this hits.

- Captured: 2026-05-17 16:56 boot (52 min, the original case); 2026-05-18 11:22 boot (23 min, **with `NVreg_DynamicPowerManagement=0` already in effect**).

- Log: `~/nvidia-crash-2026-05-18-b-1-silent.log`

### B. Xid 79 — “GPU has fallen off the bus”

- Two sub-triggers observed:

1. **`nvidia-smi` polling** while otherwise idle (2026-05-17 16:55, 2026-05-18 09:47 boot).

2. **Sustained `ollama` LLM-inference load** (~4–5 `/api/chat` requests per second, ~5 min into boot, no preceding GSP timeout — 2026-05-18 12:19 boot, **with DPM=0 + persistence mode + `pcie_aspm=off` all confirmed active**).

- After Xid 79: cascade of `_threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed`, then ~3900 `rpcSendMessage failed with status 0x0000000f` GSP RPC errors during teardown.

- Recovery: hard power-cycle required.

- Log: `~/nvidia-crash-2026-05-18-b-2.log`

### C. Xid 119 → 110 → 79 cascade under load

- Xid 119 (GSP RPC timeout) → Xid 154 (recovery action / reset required) → `SEC_FAULT: _CLOCK_GPC_FMON` → `SEC_FAULT: _DEVICE_LOCKDOWN` → Xid 79.

- `uvm encountered global fatal error 0x60, requiring os reboot to recover`.

- Triggered: `ollama` inference workload (2026-05-18 11:02 boot, 19 min uptime).

- Captured: `~/nvidia-crash-2026-05-18-b-1.log`

## Boot history (captured period only)

The system has had intermittent crashes since the machine was new (the user reports issues going back ~9–10 months). Kernel logs prior to 2026-03-22 are gone due to rotation; `journalctl --list-boots` only retains entries from 2026-03-27 onward. The boots below are the period for which detailed evidence exists.

| Boot started | Duration | Outcome |

|—|—|—|

| 2026-05-04 11:21 | 41 min | clean shutdown |

| 2026-05-17 16:27 | 27 min | Xid 79 (triggered by `nvidia-smi`) |

| 2026-05-17 16:56 | 52 min | **silent hang** at idle |

| 2026-05-17 18:33 | 15 h | Xid 79 at 04:05 then 5+ h of modeset errors |

| 2026-05-18 09:45 | 1 min | clean reboot |

| 2026-05-18 09:47 | 1 h 13 m | Xid 119 GSP timeout + kernel stack trace |

| 2026-05-18 11:02 | 19 min | Xid 119 → 110 → 79 cascade under ollama — first crash with DPM=0 in effect |

| 2026-05-18 11:22 | 23 min | **silent hang at idle** with DPM=0 in effect |

| 2026-05-18 11:47 | 18 min | clean reboot (persistence-mode override applied) |

| 2026-05-18 12:07 | 12 min | clean reboot (ASPM config staged) |

| 2026-05-18 12:19 | 5 min | **Xid 79 under ollama load** — first all-fixes boot (DPM=0 + persistence + `pcie_aspm=off`); ASPM disabled at the kernel level did not prevent fall-off-the-bus |

| 2026-05-18 12:25 | 1 min | clean reboot |

| 2026-05-18 12:27 | current | second all-fixes boot |

## Mitigations tried (and current results)

| Mitigation | Applied | Verified active | Effect on workload-triggered Xid 79 | Effect on silent idle hang |

|—|—|—|—|—|

| `NVreg_DynamicPowerManagement=0` | 2026-05-18 | `DynamicPowerManagement: 0` in `/proc/driver/nvidia/params` | did NOT prevent | did NOT prevent (recurred 11:22 boot) |

| Persistence mode (drop-in override to remove `–no-persistence-mode` from `nvidia-persistenced.service`) | 2026-05-18 12:07 | `nvidia-smi -q` reports `Persistence Mode: Enabled` | did NOT prevent | not retested |

| `pcie_aspm=off` kernel cmdline | 2026-05-18 12:19 boot | kernel logs `PCIe ASPM is disabled`; `/proc/cmdline` confirms | did NOT prevent (crashed 5 min in on first such boot) | not retested |

Power-management explanations are therefore **exhausted** for the workload-triggered fall-off-the-bus mode. The silent-idle-hang mode has not yet been retested with all three mitigations active *while idle*, because the 12:19 boot died under load before reaching idle.

## Representative log excerpt — 2026-05-18 12:19 boot, Xid 79 under ollama load

Boot started with all three mitigations active. Kernel confirms `PCIe ASPM is disabled` and ollama begins serving requests immediately. The kernel log shows zero NVIDIA messages between driver load (12:19:23) and the failure (12:24:38) — i.e., no GSP warning, no AER, no degraded link — and then the GPU vanishes mid-request:

```

May 18 12:24:37 pop-os ollama[2149]: [GIN] 2026/05/18 - 12:24:37 | 200 | 240ms | POST “/api/chat”

May 18 12:24:38 pop-os ollama[2149]: [GIN] 2026/05/18 - 12:24:38 | 200 | 324ms | POST “/api/chat”

May 18 12:24:38 pop-os kernel: NVRM: GPU at PCI:0000:01:00: GPU-4ca99914-…

May 18 12:24:38 pop-os kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.

May 18 12:24:38 pop-os kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

May 18 12:24:38 pop-os kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.

May 18 12:24:38 pop-os kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!

… (~3900 rpcSendMessage failed status 0x0000000f errors during teardown) …

May 18 12:24:44 pop-os kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress

May 18 12:24:44 pop-os kernel: fbcon: Taking over console

```

Full log: `~/nvidia-crash-2026-05-18-b-2.log`

## What I would like to understand

1. Is this a known Blackwell + 580.x open-kernel-module GSP/firmware issue, and is there an action (newer driver, firmware update, config) that has resolved it for others?

2. Is the **`nvidia-smi`-triggered Xid 79** observed on this hardware a known stale-RM-state issue that persistence mode should resolve? (It did not in our tests.)

3. Are there diagnostics — beyond `nvidia-bug-report.sh` — that would help distinguish a hardware issue (PCIe signal integrity, PSU transient response) from a firmware/driver issue? Specifically: a way to capture PCIe AER state at the moment the link drops, since none is currently logged before Xid 79.

## Attachments

- `nvidia-bug-report.log.gz` (output of `sudo nvidia-bug-report.sh` from current boot).

- `nvidia-crash-2026-05-18-b-1.log.gz` — Xid 119 → 110 → 79 cascade under ollama.

- `nvidia-crash-2026-05-18-b-1-silent.log.gz` — silent idle hang.

- `nvidia-crash-2026-05-18-b-2.log.gz` — Xid 79 under ollama with all three mitigations active.

## Reproduction notes for triage

- Idle silent hang: leave machine idle on GNOME/X11 desktop; typically hits within ~25–55 min. No `nvidia-smi` poking required.

- Xid 79 (polling): run `watch -n1 nvidia-smi`; reproduces in tens of minutes on a fresh boot.

- Xid 79 (load): run `ollama serve` and a tight loop of `/api/chat` requests; reproduced in 5 min on 2026-05-18 12:19 boot.

- Xid 119 cascade: same trigger as above; appears to be the “slow death” variant where GSP times out before the link drops.

nvidia-crash-2026-05-18-b-2.log.gz (27.2 KB)

nvidia-crash-2026-05-18-b-1-silent.log.gz (22.7 KB)

nvidia-crash-2026-05-18-b-1.log.gz (33.4 KB)

nvidia-bug-report.log.gz (498.6 KB)