Anyone have hard crashes on The DGX?

Has anyone been able to crash their DGX10?

Meaning, whole system locks, then reboots itself? I’ve done it a few times now.

The latest is with some simple SFT…

(summarized via ai)

Linux 6.14.0-1015-nvidia

Workload Description
Task: Supervised Fine-Tuning (SFT) of Qwen3-4B LLM using LLaMA Factory
Training Framework: LLaMA Factory (llamafactory-cli) with PyTorch
Precision: FP16
LoRA Configuration: rank=16, alpha=32, target=q_proj,v_proj
Batch Size: per_device=2, gradient_accumulation=16 (effective batch=32)
Context Length: 1024 tokens
Command: llamafactory-cli train configs/sft_qwen3_4b.yaml
Timeline
Time Event
Jan 10, ~09:38 System boot - GPU DOE mailbox errors logged (see below)
Jan 10, ~10:00 DAPT training started (~140 hours) - completed successfully
Jan 11, ~10:00 SFT training started
Jan 11, ~12:00 Training reached step 4550 of 5520 (~82% complete, ~26 hours cumulative GPU load)
Jan 11, 12:18 System crash - immediate reboot, all SSH sessions and screen sessions lost
Jan 11, 12:18 System rebooted with BERT hardware error recorded

Error Evidence

  1. BERT (Boot Error Record Table) Entry
    [ 1.433853] BERT: [Hardware Error]: Skipped 1 error records[ 1.433855] BERT: Total records found: 1
    The kernel could not decode the error record format (likely NVIDIA-proprietary).

BERT raw header:
00000000: 4245 5254 3000 0000 01df 4d54 4b49 4400 BERT0…MTKID.00000010: 4d54 4b54 4142 4c45 0100 0000 4352 4541 MTKTABLE…CREA

  1. GPU PCIe DOE Errors (Present Since Boot)
    Jan 10 09:38:52 gx10-01 kernel: pci 000f:01:00.0: DOE: [2c8] ABORT timed outJan 10 09:38:52 gx10-01 kernel: pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5Jan 10 09:38:52 gx10-01 kernel: pci 000f:01:00.0: DOE: [2c8] failed to create mailbox: -5

  2. GPU AER Status (Post-Reboot - Error Still Set)
    lspci -vvv -s 000f:01:00.0 output:DOESta: Busy+ IntSta+ Error+ ObjectReady-
    Error+ = Error flag is SET
    Busy+ = Interface stuck busy
    ObjectReady- = Not ready

  3. Platform Device Errors
    Jan 10 09:38:52 gx10-01 kernel: platform NVDA8800:00: failed to claim resource 0: [mem 0x05170000-0x051cffff]Jan 10 09:38:52 gx10-01 kernel: acpi NVDA8800:00: platform device creation failed: -16Jan 10 09:38:52 gx10-01 kernel: platform NVDA8900:00: failed to claim resource 0: [mem 0xc8000000-0xd7ffffff]Jan 10 09:38:52 gx10-01 kernel: acpi NVDA8900:00: platform device creation failed: -16

Can you give details set of steps so we can attempt to reproduce?

Yup!

What happened:

  1. Ran DAPT training on Qwen3-4B using local Python venv - ran for ~140 hours, completed

  2. Started SFT training immediately after (same local venv setup)

  3. ~2 hours into SFT (step 4550 of 5520), system hard locked and rebooted

  4. Tried resuming from checkpoint - crashed again within minutes

Workaround that fixed it:

Switched to using nvcr.io/nvidia/pytorch:25.11-py3 container instead of local venv. Training completed successfully AND was about 15% faster.

What I think happened:

The pip-installed PyTorch probably doesn’t have the same GPU optimizations as your container. Maybe it’s also doing something that triggers the PCIe issues I see in logs:

pci 000f:01:00.0: DOE: [2c8] ABORT timed out

pci 000f:01:00.0: DOE: [2c8] failed to reset mailbox with abort command : -5

These errors appear at every boot, even before training starts.

Questions:

Is using the container the recommended approach for GB10?

Should pip-installed PyTorch work on GB10, or is there something special about the hardware that requires the container?

Any idea what the DOE mailbox errors mean? They seem harmless but make me nervous.

Happy to provide more details if needed.

I think these were the settings:

model_name_or_path: Qwen/Qwen3-4B-Instruct-2507
adapter_name_or_path: runs/dapt_qwen3_4b

stage: sft
template: alpaca
finetuning_type: lora
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target: q_proj,v_proj

num_train_epochs: 1.0
per_device_train_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 1.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1

fp16: true
cutoff_len: 1024
packing: true

save_strategy: steps
save_steps: 500

I get hard crashes all the time. It’s a bit of a pain if I’m not near the machine to unplug and replug it.

For me it happens if I have a large memory consuming process (for example, vllm with a model loaded) and I start compiling (for example, testing improvements to flashinfer that require compiling).

Unfortunately sometimes forget to stop the vllm before starting the compile.

1 Like

I was blaming myself. A memory leak or something. So I was monitoring cpu, gpu, ram, swap, hd space, temps.

When it crashed it wasn’t hot, only using like 48GB total, no swapping, more than 1/2 the HD empty, light CPU. GPU was high but that’s what happens when you train. GPU temp was not high.

If those are happening when hitting out of memory conditions, others have found that a temporary workaround seems to be disabling swap. Not ideal, but not sure there’s a better way to handle OOM hard locks right now?

I’ve not tried this approach. Maybe I should?

I tried tuning the number of build processes down to stop it from happening in the general case (compiling cubins can be a huge memory use on their own).

I think I should get one of those fancy remote kvm units you see on YouTube all the time so I can remote power cycle.

I’m hopeful that nVidia is still paying canonical to improve Ubuntu for the Spark and that we’ll see better edge-case handling in future updates.

If you think you’re hitting OOM hard locks, I’d definitely try it. Something seems very wrong with how swap is used on GB10. When I disabled swap and forced OOM situations, it would fairly gracefully kill processes as best it could and I couldn’t get it to hard lock. With swap enabled, it’s a hard lock requiring holding the power button every time.

It used to be that it would hang immediately after hitting the swap. In the past month is happens only when swap is >90% full.

1 Like