Your setup doesn't support bf16/gpu

On my DGX Spark, when trying to run:

docker run --gpus all --rm -it -v /home/rob/src/llm-training:/workspace -w /workspace ``nvcr.io/nvidia/pytorch:25.11-py3`` bash -c “pip install llamafactory && llamafactory-cli train configs/dapt_qwen3_4b.yaml”

I get error:

File “/usr/local/lib/python3.12/dist-packages/llamafactory/hparams/training_args.py”, line 105, in post_init
BaseTrainingArguments.post_init(self)
File “/usr/local/lib/python3.12/dist-packages/transformers/training_args.py”, line 1747, in post_init
raise ValueError(error_message)
ValueError: Your setup doesn’t support bf16/gpu.

But the whole reason for the spark is to bf16, right?

Any help is appreciated. Thanks!

Hi robs1978,

I’m seeing similar behavior with llama-factory and other models. Is your dapt_qwen3_4b.yaml a custom configuration, or did you get it from the official llama-factory repository?

pip install llamafactory is installing a CPU only version of pytorch 2.9.1 which is causing the error, why do you need llamafactory?

Downloading torch-2.9.1-cp312-cp312-manylinux_2_28_aarch64.whl (104.1 MB)

Run this command to diagnose the error, docker run --gpus all --rm -it nvcr.io/nvidia/pytorch:25.11-py3 bash -c “pip install llamafactory && nvidia-smi && python -c ‘import torch; print(f"Torch version: {torch.version}“); print(f"BF16 supported: {torch.cuda.is_bf16_supported()}”)’”

I got these results,

Fri Jan 9 15:48:54 2026
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 44C P8 5W / N/A | Not Supported | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+
Torch version: 2.9.1+cpu
BF16 supported: False

Torch included in the pytorch-25.11.py3 will not have this issue. It supports bf16. Llamafactory is overwriting torch installlation.

Aha. This makes sense. Because I had such a hard time getting GPU running without docker, I wasn’t thinking and didn’t realize it was already installed on the nvidia docker… I’ll give it a try soon.

Here is what finally worked for me:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864
–rm -it
-e DISABLE_VERSION_CHECK=1
-v /path/to/your/project:/workspace
-v ~/.cache/huggingface:/root/.cache/huggingface
-w /workspace
nvcr.io/nvidia/pytorch:25.11-py3``
bash -c "
pip install llamafactory --no-deps &&
pip install torchaudio --no-deps &&
pip install datasets transformers accelerate peft trl bitsandbytes
sentencepiece protobuf pyyaml packaging fire tyro omegaconf
gradio uvicorn sse-starlette matplotlib scipy &&
llamafactory-cli train /workspace/your-config.yaml
"

The difference is --no-deps on llamafactory and torchaudio. This prevents pip from pulling in its own torch packages and breaking what’s already in the container.
Also had to manually install the other dependencies since --no-deps skips them, and set DISABLE_VERSION_CHECK=1 because transformers 4.57.3 in the container is slightly newer than what LLaMA Factory’s version check allows (<=4.57.1). Works fine (so far).
Hope this saves someone else a few hours.

I have not completed a full SFT with this method yet. But at least it’s up and running right now.

How long is the SFT running? Are you using nvfp4 weights and activations, fastest on Spark.

Thanks again for the reply.

SFT duration: The SFT that crashed was at step 4550 of 5520 (~82% complete), roughly 2 hours in. It was training on ~165K Q&A pairs. Crashed when running in local venv, worked fine after switching to the NVIDIA container.

NVFP4: Not using it - I’m doing LoRA fine-tuning with FP16. My understanding is NVFP4 is mainly for inference, not training? Happy to be corrected if there’s a training workflow that uses it.

BF16: Interesting - I assumed GB10 supports BF16 (Blackwell arch), but I got that error when using pip-installed PyTorch outside the container. Inside the NVIDIA container, I stuck with FP16 since it was working. Is BF16 actually supported on GB10, and is there a benefit over FP16 for training?

Only torch-2.10.0 cu130 and beyond are aware of the GB10 Micro-Architecture SM-121. I would recommend only training with SM-121.

I have heard some video applications have better quality inference at MXFP8, which is a no brainer. But I think training at BF16 is a standard.

When to Use Each

  • BF16 for Training Stability: BF16 is the standard for training large models (LLMs) because its wide range prevents numerical “overflow” or “underflow” (where numbers become too large or too small to represent). This eliminates the need for complex “loss scaling” techniques required by FP16.
  • FP16 for Inference & Efficiency: FP16 is often preferred for inference tasks where speed is critical and the range of values is already well-defined. On specific 2026-era benchmarks, FP16 has shown up to 3x faster evaluation speeds compared to BF16 during certain inference-heavy cycles.

Thanks @bfurtaw! That explains it - I was hitting the BF16 error with pip-installed PyTorch. Switching to the NVIDIA container fixed it (likely has torch 2.10+ cu130).

For SFT timing: earlier test run was ~2hrs at step 4550/5520 when it crashed (venv issue, not container). Current DAPT run is processing ~1.7M chunks with LoRA on Qwen3-4B. Will report back with full timings.

I wish Nvidia would update DGX OS with pytorch, nvtop, etc that are compatible with SM-121.

I don’t know if I’m in the minority. For my specific project I bought the spark for, docker is just an extra layer.

Thanks again!