Meaning, whole system locks, then reboots itself? I’ve done it a few times now.
The latest is with some simple SFT…
(summarized via ai)
Linux 6.14.0-1015-nvidia
Workload Description
Task: Supervised Fine-Tuning (SFT) of Qwen3-4B LLM using LLaMA Factory
Training Framework: LLaMA Factory (llamafactory-cli) with PyTorch
Precision: FP16
LoRA Configuration: rank=16, alpha=32, target=q_proj,v_proj
Batch Size: per_device=2, gradient_accumulation=16 (effective batch=32)
Context Length: 1024 tokens
Command: llamafactory-cli train configs/sft_qwen3_4b.yaml
Timeline
Time Event
Jan 10, ~09:38 System boot - GPU DOE mailbox errors logged (see below)
Jan 10, ~10:00 DAPT training started (~140 hours) - completed successfully
Jan 11, ~10:00 SFT training started
Jan 11, ~12:00 Training reached step 4550 of 5520 (~82% complete, ~26 hours cumulative GPU load)
Jan 11, 12:18 System crash - immediate reboot, all SSH sessions and screen sessions lost
Jan 11, 12:18 System rebooted with BERT hardware error recorded
Error Evidence
BERT (Boot Error Record Table) Entry
[ 1.433853] BERT: [Hardware Error]: Skipped 1 error records[ 1.433855] BERT: Total records found: 1
The kernel could not decode the error record format (likely NVIDIA-proprietary).
Ran DAPT training on Qwen3-4B using local Python venv - ran for ~140 hours, completed
Started SFT training immediately after (same local venv setup)
~2 hours into SFT (step 4550 of 5520), system hard locked and rebooted
Tried resuming from checkpoint - crashed again within minutes
Workaround that fixed it:
Switched to using nvcr.io/nvidia/pytorch:25.11-py3 container instead of local venv. Training completed successfully AND was about 15% faster.
What I think happened:
The pip-installed PyTorch probably doesn’t have the same GPU optimizations as your container. Maybe it’s also doing something that triggers the PCIe issues I see in logs:
pci 000f:01:00.0: DOE: [2c8] ABORT timed out
pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5
These errors appear at every boot, even before training starts.
Questions:
Is using the container the recommended approach for GB10?
Should pip-installed PyTorch work on GB10, or is there something special about the hardware that requires the container?
Any idea what the DOE mailbox errors mean? They seem harmless but make me nervous.
I get hard crashes all the time. It’s a bit of a pain if I’m not near the machine to unplug and replug it.
For me it happens if I have a large memory consuming process (for example, vllm with a model loaded) and I start compiling (for example, testing improvements to flashinfer that require compiling).
Unfortunately sometimes forget to stop the vllm before starting the compile.
I was blaming myself. A memory leak or something. So I was monitoring cpu, gpu, ram, swap, hd space, temps.
When it crashed it wasn’t hot, only using like 48GB total, no swapping, more than 1/2 the HD empty, light CPU. GPU was high but that’s what happens when you train. GPU temp was not high.
If those are happening when hitting out of memory conditions, others have found that a temporary workaround seems to be disabling swap. Not ideal, but not sure there’s a better way to handle OOM hard locks right now?
I tried tuning the number of build processes down to stop it from happening in the general case (compiling cubins can be a huge memory use on their own).
I think I should get one of those fancy remote kvm units you see on YouTube all the time so I can remote power cycle.
I’m hopeful that nVidia is still paying canonical to improve Ubuntu for the Spark and that we’ll see better edge-case handling in future updates.
If you think you’re hitting OOM hard locks, I’d definitely try it. Something seems very wrong with how swap is used on GB10. When I disabled swap and forced OOM situations, it would fairly gracefully kill processes as best it could and I couldn’t get it to hard lock. With swap enabled, it’s a hard lock requiring holding the power button every time.