610 release feedback & discussion


SUMMARY

On nvidia-open 610.43.02, several Direct3D12 / Unreal Engine 5 games run through
Proton (vkd3d-proton) exhibit a periodic ~5.2-second full-application freeze that
recovers on its own, with NO Xid and NO message logged. Kernel-stack sampling
during the freeze shows the vkd3d-proton submission thread blocked the entire time
in drm_syncobj_array_wait_timeout on the NVIDIA render node — i.e. waiting on a
GPU completion fence that is not signalled for ~5 s. The freeze duration is a tight
constant across 18 captures (5062–5487 ms), which matches the 5200 ms GSP firmware
heartbeat timeout reported in issue #1080. This appears to be a recoverable,
light-load sibling of the terminal Blackwell GSP heartbeat crash (#1080): same GPU
family and fence/completion path, but it recovers at the heartbeat boundary instead
of escalating to Xid 8/109 + GPU lockup.


ENVIRONMENT

GPU : NVIDIA GeForce RTX 5080 (Blackwell, GB203)
Driver : nvidia-open 610.43.02 (open kernel modules)
Kernel : 7.1.1-cachyos (x86_64); also reproduced on 7.0.12-cachyos
Distro : CachyOS (Arch-based)
Desktop : KDE Plasma 6.7.1 / KWin, native Wayland session
Display : 3440x1440 @ 175 Hz, HDR + VRR, single display (DP)
CPU : Intel Core i9-13900K
Proton : proton-cachyos (Steam Linux Runtime), vkd3d-proton 3.1.0
Affected games : Tekken 8, Split Fiction, LEGO Batman: Legacy of the Dark
Knight (all Unreal Engine 5, Direct3D12 via vkd3d-proton)


SYMPTOM / REPRODUCTION

  • Every few minutes of play the entire game freezes for ~5.0–5.5 s, then resumes
    with no crash, no recovery prompt, and no corruption.
  • Strongly favours scene/state transitions (e.g. gameplay ↔ in-engine cutscene).
  • Reproduces across three different UE5/D3D12 titles, so it is not game-specific.
  • During the freeze the GPU goes idle (load drops, core clock falls); it is not a
    busy GPU — the work appears complete but the completion is not delivered.

DIAGNOSTIC EVIDENCE (observed)

Per-thread kernel-stack sampling (reading /proc//stack at ~150 ms intervals,
cross-referenced to MangoHud frametime logs) during a confirmed 5251 ms freeze:

The vkd3d-proton submission thread (“vkd3d_queue”) was blocked for the FULL
duration of the freeze (14 consecutive samples, perfectly contiguous with the
MangoHud freeze window) in:

  drm_syncobj_array_wait_timeout
  drm_syncobj_timeline_wait_ioctl
  drm_ioctl
  __x64_sys_ioctl
  do_syscall_64

on fd → /dev/dri/renderD128 (the NVIDIA render node).

The rest of the engine’s threads were parked behind that one thread (D3D12 fence
events backed by Proton’s Win32-sync layer), which is why earlier analysis that
only looked at wait-channels misread this as a CPU-side sync wait — the actual
root is the unsatisfied DRM sync-object (GPU fence) wait.

Freeze-duration distribution (18 captures): 5062, 5153, 5155, 5228, 5248, 5251,
5260, 5317, 5318, 5357, 5372, 5381, 5383, 5401, 5409, 5467, 5484, 5487 ms.

Logs at every freeze: NO NVRM Xid, NO GSP message (no kgspIsHeartbeatTimedOut),
NO kernel fault, in journald this boot and across boots.


HYPOTHESIS (inferred — flagged as such)

The fixed ~5.2 s duration has no corresponding constant anywhere in Wine, Proton,
vkd3d, or UE5, but it matches the 5200 ms GSP firmware heartbeat timeout reported
in NVIDIA/open-gpu-kernel-modules issue #1080 (RTX 5090, GSP heartbeat timeout
→ Xid 109 under heavy Vulkan load). Proposed single mechanism, two severities:

  • A Blackwell GSP firmware stall in GPU-completion / fence signalling.
  • Under sustained heavy load (#1080, RTX 5090): the stall exceeds 5200 ms,
    the heartbeat is declared dead, and it escalates to Xid 8 → Xid 109 →
    GPU lockup (terminal, requires bus reset).
  • Under light load at a scene transition (this RTX 5080): the stall reaches the
    ~5200 ms heartbeat boundary, the GSP recovers WITHOUT escalating, the pending
    drm_syncobj completion is finally delivered, and the application resumes —
    recoverable, and nothing is logged.

This is a strong circumstantial match (5200 ms duration + same GPU family + same
GSP/fence-completion path), but NOT confirmed from this side — there is no GSP
message in this system’s logs to tie it directly (consistent with a sub-threshold
recovery, but not proof).


RULED OUT (each tested with primary evidence)

Confirmed independent of: the specific game (3 UE5 titles); Proton sync backend
(reproduces with ntsync and with fsync); vkd3d-proton descriptor model
(reproduces with descriptor_heap and with virtual_heaps); Bluetooth / controller;
storage / NVMe APST; CPU C-states; VSync / frame pacing; the Wayland compositor
(KDE Plasma 6.7.1). The only host-side factor that changes FREQUENCY (not
presence) is the CPU power profile (EPP=performance pins the clock and the freeze
becomes more frequent) — consistent with a timing-sensitive completion/wakeup.


REQUEST

Please investigate the Blackwell GSP firmware GPU-completion / fence-signalling
path (DRM sync-object signalling) under D3D12/Vulkan workloads at resource-
transition points. We believe this is the recoverable counterpart of the GSP
heartbeat failure in #1080, and a fix to the GSP heartbeat / fence-completion
handling on Blackwell should resolve both.

Attached: nvidia-bug-report.sh output, the kernel-stack capture log (the
drm_syncobj_array_wait_timeout evidence), and the MangoHud frametime CSV
correlating each freeze. Glad to run any driver build, registry-key (e.g.
RmEngineContextSwitchTimeoutUs), or GSP logging request that would help localise
the missing completion.

nvidia-bug-report.log.gz (492.6 KB)

freeze-autopsy.txt (14.0 KB)

freeze-probe4-kstack.log (2.5 MB)