# Bug report — RTX 5070 Ti intermittent crashes (Xid 79, Xid 119 cascade, silent hangs)
## TL;DR
System76 Thelio Mira (Ryzen 9 9950X + RTX 5070 Ti, Blackwell GB203) on Pop!_OS 22.04 with NVIDIA 580.x open kernel module. **Intermittent crashes since the machine was new (~9–10 months).** Three distinct failure signatures observed; all reproducible across driver bumps 580.82 → 580.119 → 580.126 → 580.159 and across kernels 6.12 → 6.16 → 6.17. Power-management workarounds (DPM=0, persistence mode, `pcie_aspm=off`) have been tried in combination and **do not prevent the workload-triggered fall-off-the-bus mode**.
Detailed kernel logs are available only from **2026-05-17 onward** due to log rotation; earlier crashes are observed by the user but not captured.
## System
| | |
|—|—|
| Vendor / model | System76 Thelio Mira (r4) |
| BIOS | American Megatrends 4.10.SP01, 2026-02-23 |
| CPU | AMD Ryzen 9 9950X (16-core) |
| GPU | NVIDIA GeForce RTX 5070 Ti (Blackwell, GB203-A) — PCI ID `10de:2c05`, subsystem `1458:41a8` (Gigabyte) |
| VBIOS | 98.03.3B.80.C6 |
| GPU memory | 16 GB |
| Display path | **AMD iGPU drives the display.** NVIDIA card is idle except for compute. |
| OS | Pop!_OS 22.04 |
| Kernel (current) | 6.17.9-76061709-generic (System76 build, kernel-config `202511241048`) |
| NVIDIA driver (current) | 580.159.03 (open kernel module) |
| OS install date | 2025-08-09 (rootfs / machine-id birth) |
## Driver / kernel timeline (apt history)
Drivers used on this machine — all 580-series so far have been the **open** kernel module:
| Date | NVIDIA driver | Kernel |
|—|—|—|
| 2025-07-02 | nvidia-driver-570-open 570.153.02 | linux-system76 6.12.10 |
| 2025-08-09 | (same driver) | 6.12.10 (rebuild) |
| 2025-09-10 | (same driver) | **6.16.3** (major bump) |
| 2025-09-22 / 09-25 | 570.153.02 (rebuilds) | 6.16.3 (rebuilds) |
| **2025-10-08** | **transition to 580.82.09-open** | 6.16.3 |
| 2025-10-10 / 10-11 | 580.82.09 (rebuilds) | 6.16.3 |
| 2025-10-21 | — | 6.16.3 (rebuild) |
| 2025-11-11 | — | **6.17.4** |
| 2026-01-07 | — | **6.17.9** |
| 2026-01-15 | **580.119.02** | 6.17.9 |
| 2026-03-19 | **580.126.18** | 6.17.9 |
| 2026-05-04 | **580.159.03** | 6.17.9 (rebuild) |
| 2026-05-17 | 580.159.03 | 6.17.9 (rebuild) |
Crashes have been intermittent across **every** combination listed above per the user’s recollection. The recent (May 17–18) cluster captured in logs occurred on 580.159.03 + 6.17.9.
## Failure signatures
Three distinct kernel-log signatures observed:
### A. Silent hang at idle
- No Xid logged.
- Journal truncates mid-stream; no errors, no panic, no NVRM message.
- Recovery: hard power-cycle required.
- Triggered: with the machine sitting idle on the desktop. The display is on the AMD iGPU, so the NVIDIA card is genuinely idle when this hits.
- Captured: 2026-05-17 16:56 boot (52 min, the original case); 2026-05-18 11:22 boot (23 min, **with `NVreg_DynamicPowerManagement=0` already in effect**).
- Log: `~/nvidia-crash-2026-05-18-b-1-silent.log`
### B. Xid 79 — “GPU has fallen off the bus”
- Two sub-triggers observed:
1. **`nvidia-smi` polling** while otherwise idle (2026-05-17 16:55, 2026-05-18 09:47 boot).
2. **Sustained `ollama` LLM-inference load** (~4–5 `/api/chat` requests per second, ~5 min into boot, no preceding GSP timeout — 2026-05-18 12:19 boot, **with DPM=0 + persistence mode + `pcie_aspm=off` all confirmed active**).
- After Xid 79: cascade of `_threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed`, then ~3900 `rpcSendMessage failed with status 0x0000000f` GSP RPC errors during teardown.
- Recovery: hard power-cycle required.
- Log: `~/nvidia-crash-2026-05-18-b-2.log`
### C. Xid 119 → 110 → 79 cascade under load
- Xid 119 (GSP RPC timeout) → Xid 154 (recovery action / reset required) → `SEC_FAULT: _CLOCK_GPC_FMON` → `SEC_FAULT: _DEVICE_LOCKDOWN` → Xid 79.
- `uvm encountered global fatal error 0x60, requiring os reboot to recover`.
- Triggered: `ollama` inference workload (2026-05-18 11:02 boot, 19 min uptime).
- Captured: `~/nvidia-crash-2026-05-18-b-1.log`
## Boot history (captured period only)
The system has had intermittent crashes since the machine was new (the user reports issues going back ~9–10 months). Kernel logs prior to 2026-03-22 are gone due to rotation; `journalctl --list-boots` only retains entries from 2026-03-27 onward. The boots below are the period for which detailed evidence exists.
| Boot started | Duration | Outcome |
|—|—|—|
| 2026-05-04 11:21 | 41 min | clean shutdown |
| 2026-05-17 16:27 | 27 min | Xid 79 (triggered by `nvidia-smi`) |
| 2026-05-17 16:56 | 52 min | **silent hang** at idle |
| 2026-05-17 18:33 | 15 h | Xid 79 at 04:05 then 5+ h of modeset errors |
| 2026-05-18 09:45 | 1 min | clean reboot |
| 2026-05-18 09:47 | 1 h 13 m | Xid 119 GSP timeout + kernel stack trace |
| 2026-05-18 11:02 | 19 min | Xid 119 → 110 → 79 cascade under ollama — first crash with DPM=0 in effect |
| 2026-05-18 11:22 | 23 min | **silent hang at idle** with DPM=0 in effect |
| 2026-05-18 11:47 | 18 min | clean reboot (persistence-mode override applied) |
| 2026-05-18 12:07 | 12 min | clean reboot (ASPM config staged) |
| 2026-05-18 12:19 | 5 min | **Xid 79 under ollama load** — first all-fixes boot (DPM=0 + persistence + `pcie_aspm=off`); ASPM disabled at the kernel level did not prevent fall-off-the-bus |
| 2026-05-18 12:25 | 1 min | clean reboot |
| 2026-05-18 12:27 | current | second all-fixes boot |
## Mitigations tried (and current results)
| Mitigation | Applied | Verified active | Effect on workload-triggered Xid 79 | Effect on silent idle hang |
|—|—|—|—|—|
| `NVreg_DynamicPowerManagement=0` | 2026-05-18 | `DynamicPowerManagement: 0` in `/proc/driver/nvidia/params` | did NOT prevent | did NOT prevent (recurred 11:22 boot) |
| Persistence mode (drop-in override to remove `–no-persistence-mode` from `nvidia-persistenced.service`) | 2026-05-18 12:07 | `nvidia-smi -q` reports `Persistence Mode: Enabled` | did NOT prevent | not retested |
| `pcie_aspm=off` kernel cmdline | 2026-05-18 12:19 boot | kernel logs `PCIe ASPM is disabled`; `/proc/cmdline` confirms | did NOT prevent (crashed 5 min in on first such boot) | not retested |
Power-management explanations are therefore **exhausted** for the workload-triggered fall-off-the-bus mode. The silent-idle-hang mode has not yet been retested with all three mitigations active *while idle*, because the 12:19 boot died under load before reaching idle.
## Representative log excerpt — 2026-05-18 12:19 boot, Xid 79 under ollama load
Boot started with all three mitigations active. Kernel confirms `PCIe ASPM is disabled` and ollama begins serving requests immediately. The kernel log shows zero NVIDIA messages between driver load (12:19:23) and the failure (12:24:38) — i.e., no GSP warning, no AER, no degraded link — and then the GPU vanishes mid-request:
```
May 18 12:24:37 pop-os ollama[2149]: [GIN] 2026/05/18 - 12:24:37 | 200 | 240ms | POST “/api/chat”
May 18 12:24:38 pop-os ollama[2149]: [GIN] 2026/05/18 - 12:24:38 | 200 | 324ms | POST “/api/chat”
May 18 12:24:38 pop-os kernel: NVRM: GPU at PCI:0000:01:00: GPU-4ca99914-…
May 18 12:24:38 pop-os kernel: NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
May 18 12:24:38 pop-os kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
May 18 12:24:38 pop-os kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all channels for critical error 79.
May 18 12:24:38 pop-os kernel: NVRM: _threadNodeCheckTimeout: API_GPU_ATTACHED_SANITY_CHECK failed!
… (~3900 rpcSendMessage failed status 0x0000000f errors during teardown) …
May 18 12:24:44 pop-os kernel: nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress
May 18 12:24:44 pop-os kernel: fbcon: Taking over console
```
Full log: `~/nvidia-crash-2026-05-18-b-2.log`
## What I would like to understand
1. Is this a known Blackwell + 580.x open-kernel-module GSP/firmware issue, and is there an action (newer driver, firmware update, config) that has resolved it for others?
2. Is the **`nvidia-smi`-triggered Xid 79** observed on this hardware a known stale-RM-state issue that persistence mode should resolve? (It did not in our tests.)
3. Are there diagnostics — beyond `nvidia-bug-report.sh` — that would help distinguish a hardware issue (PCIe signal integrity, PSU transient response) from a firmware/driver issue? Specifically: a way to capture PCIe AER state at the moment the link drops, since none is currently logged before Xid 79.
## Attachments
- `nvidia-bug-report.log.gz` (output of `sudo nvidia-bug-report.sh` from current boot).
- `nvidia-crash-2026-05-18-b-1.log.gz` — Xid 119 → 110 → 79 cascade under ollama.
- `nvidia-crash-2026-05-18-b-1-silent.log.gz` — silent idle hang.
- `nvidia-crash-2026-05-18-b-2.log.gz` — Xid 79 under ollama with all three mitigations active.
## Reproduction notes for triage
- Idle silent hang: leave machine idle on GNOME/X11 desktop; typically hits within ~25–55 min. No `nvidia-smi` poking required.
- Xid 79 (polling): run `watch -n1 nvidia-smi`; reproduces in tens of minutes on a fresh boot.
- Xid 79 (load): run `ollama serve` and a tight loop of `/api/chat` requests; reproduced in 5 min on 2026-05-18 12:19 boot.
- Xid 119 cascade: same trigger as above; appears to be the “slow death” variant where GSP times out before the link drops.
nvidia-crash-2026-05-18-b-2.log.gz (27.2 KB)
nvidia-crash-2026-05-18-b-1-silent.log.gz (22.7 KB)
nvidia-crash-2026-05-18-b-1.log.gz (33.4 KB)
nvidia-bug-report.log.gz (498.6 KB)