I am experiencing recurring hard system locks when running large MoE models with llama.cpp on my RTX PRO 4000 Blackwell cards. The system becomes completely unresponsive with the following symptoms:
-
Monitor goes completely blank with no error displayed
-
Keyboard becomes unresponsive (Caps Lock does not respond)
-
Front panel power button does nothing
-
The only way to recover is a full physical power cycle (unplug PSU)
This happens even when running on a single GPU.
System Configuration:
-
GPUs: 2× NVIDIA RTX PRO 4000 Blackwell (10de:2c34)
-
Driver: 580.126.09 (open kernel modules) — also tested 595.58.03
-
Platform: Proxmox VE 9.1 (Debian 13 trixie)
-
CPU: AMD Ryzen (X470 chipset)
-
RAM: 128 GB
-
Workload: llama.cpp server with NVIDIA Nemotron-3-Super-120B-A12B-Q4_K_M
What I have tested:
-
Both driver versions 595.58.03 and 580.126.09
-
Very conservative settings (single GPU, only 10 GPU layers, high CPU offload, small context, low batch size)
-
Hardware validation: gpu-burn runs clean for extended periods with no errors, PCIe negotiates properly under load, all voltages and temperatures normal
Logs / Diagnostics:
Unfortunately I am unable to provide debug logs or nvidia-bug-report.sh output because the system locks up so quickly and completely that I cannot capture any data before it dies. IPMI SEL shows no power or thermal events, and dmesg/journalctl from the previous boot contain no relevant errors.
Grok suggests that this could be related to GSP firmware but I have no idea how reliable that is.
I would appreciate any guidance or a firmware/driver fix for this issue.
