RTX PRO 4000 Blackwell - Hard system lock / full chip reset during LLM inference

I am experiencing recurring hard system locks when running large MoE models with llama.cpp on my RTX PRO 4000 Blackwell cards. The system becomes completely unresponsive with the following symptoms:

  • Monitor goes completely blank with no error displayed

  • Keyboard becomes unresponsive (Caps Lock does not respond)

  • Front panel power button does nothing

  • The only way to recover is a full physical power cycle (unplug PSU)

This happens even when running on a single GPU.

System Configuration:

  • GPUs: 2× NVIDIA RTX PRO 4000 Blackwell (10de:2c34)

  • Driver: 580.126.09 (open kernel modules) — also tested 595.58.03

  • Platform: Proxmox VE 9.1 (Debian 13 trixie)

  • CPU: AMD Ryzen (X470 chipset)

  • RAM: 128 GB

  • Workload: llama.cpp server with NVIDIA Nemotron-3-Super-120B-A12B-Q4_K_M

What I have tested:

  • Both driver versions 595.58.03 and 580.126.09

  • Very conservative settings (single GPU, only 10 GPU layers, high CPU offload, small context, low batch size)

  • Hardware validation: gpu-burn runs clean for extended periods with no errors, PCIe negotiates properly under load, all voltages and temperatures normal

Logs / Diagnostics:
Unfortunately I am unable to provide debug logs or nvidia-bug-report.sh output because the system locks up so quickly and completely that I cannot capture any data before it dies. IPMI SEL shows no power or thermal events, and dmesg/journalctl from the previous boot contain no relevant errors.

Grok suggests that this could be related to GSP firmware but I have no idea how reliable that is.

I would appreciate any guidance or a firmware/driver fix for this issue.

I tried to submit a ticket for this and was told by the techs that my only support channel was this forum and that the mods here would help me. Is that incorrect? Should I be reaching out in some other way?

I’m sorry but there’s no other way to put this. Post on the forum and hope some stranger is kind enough to help you, is the level of support I expect when buying used mining cards on ebay. Not what I was expecting when dropping half a used car worth for new products from an authorized dealer.

Sorry to say this, but not sure what else you expected when buying an Nvidia hardware…

I was hoping they would at least have the courtesy to patronize me a little bit. Be there when I wake up before telling me you’re going out for milk never to be seen again.

Just in case this issue ever comes up for anybody else. I was able to resolve it. Best guess, it was an issue with the BIOS on my motherboard. I had upgraded to version 4.29A of the BIOS to get rbar support. I suspect that very new bios for a very old board didn’t get the full quality control from Asrock. After I upgraded to an X570 board (to get pcie4) the problem disappeared.

I’ll grant you that this wasn’t Nvidia’s fault but we had no way to know that a week ago and it easily could have been. I’m still upset by being told to go figure it out for myself when dealing with a brand new product.

glad to hear you’ve resolved it! :)

Source: NVLink vs PCIe - Blackwell GPU Wiki