NVFP4 quantization of a 100B-class Llama on 2× DGX Spark — lessons + open questions

Hi all — first post here, sharing some Spark-cluster work and asking
the team a couple of things.

I just published an NVFP4 quantization of TheDrummer’s Anubis-Pro-105B-v1
(Llama-3.3-derived, 105B) on Hugging Face — afaict the first RP/storytelling
model at the 100B+ class in NVFP4 for Spark:

It was produced on a personal 2× DGX Spark cluster (256 GB combined UMA
over ConnectX-7), since the standard single-node modelopt hf_ptq.py path
silently OOM-kills on Spark for any model in this size class. The
distributed pipeline (Ray, layer-sharded across both Sparks, BF16
calibration → NVFP4 export via modelopt 0.43) is documented in the
model card, including the bugs I had to work around before vLLM would
serve the output correctly:

  1. accelerate.infer_auto_device_map misdetects GB10 unified memory
    as a ~5.2 TB GPU and triggers silent OOM in single-node flows.
  2. modelopt 0.43’s NVFP4 export writes input_activations.dynamic=false
    but does NOT emit any input_scale keys. vLLM then registers an
    uninitialized Parameter for them and produces garbage output until
    you inject input_scale=1.0 sidecar keys for every quantized Linear
    (840 in this case).
  3. modelopt’s per-layer export-via-1-layer-template trick needs
    vocab_size=2 (not 1) because the internal llm_dummy_forward feeds
    torch.ones([1,2]) into the embedding.
  4. Phase-6 OOM mitigation by clearing _calibrator was incompatible
    with set_quantizer_by_cfg_context in modelopt 0.43 — had to keep
    the calibrator object around.
    Full six-fix list with rationale is in the model card under “Recent
    Fixes”. Pipeline auto-applies these now.

End result: model loads in vLLM, output is coherent (literary
continuation, correct arithmetic), calibration came out clean
(good=420 zero=0 nan=0 per shard).

A few honest disclaimers up front:

  • The model uses NVFP4_DEFAULT_CFG with only lm_head in the ignore
    list — NOT a GB10-bandwidth-tuned recipe like @saricles’ -GB10
    releases. I didn’t earn the -GB10 suffix and don’t claim it. A
    future v2 might apply that recipe.

  • My benchmark numbers are stock vLLM 0.20.2rc1.dev53+g01b9b5af6
    without VLLM_NVFP4_GEMM_BACKEND=marlin or the other env vars
    @tbraun96 documented in “We unlocked NVFP4 on the DGX Spark”. They
    likely undersell what’s achievable on the proper Spark runtime
    (Avarok’s avarok/dgx-vllm-nvfp4-kernel image). Re-bench on that
    stack to follow as a separate update.

  • For “is the fast path active?”: on Spark the right startup-log
    signals are “Using AttentionBackendEnum.FLASHINFER backend.” and,
    for MoE models, “Using ‘MARLIN’ NvFp4 MoE backend”. Our model is
    dense Llama so the MoE-marlin doesn’t apply — for the dense
    matmuls, what matters is that vLLM doesn’t fall back to W4A16 or
    generic marlin instead of the NVFP4-specific kernels.

Performance (stock vLLM, single Spark, no runtime env-var tuning,
no warmup besides what vLLM does automatically — re-bench on
Avarok’s image is coming):

Context Prompt-proc Decode (per stream) Memory
4 096 ~340 tok/s ~3.1 tok/s ~109 GB
16 384 ~650 tok/s ~2.9 tok/s ~109 GB
32 768 ~850 tok/s ~2.9 tok/s ~109 GB

Aggregate at concurrency 4 (4K ctx): ~10.4 tok/s output, ~167 tok/s
total throughput.

Open questions for the team:

  • Is the accelerate UMA-misdetection on Spark/GB10 tracked anywhere
    internally? Known patch in flight?
  • Is there an official NGC modelopt build newer than 0.43 that fixes
    the input_scale omission? Happy to validate with my AI Enterprise
    eval against the same pipeline.
  • vLLM PR #41925 (Cohere2 NVFP4 loading) — on anyone’s radar? I
    have a Fallen-Command-111B NVFP4 sitting on disk waiting for it.

Big thanks to @tbraun96 / Avarok-Cybersecurity for the MARLIN backend
port that made NVFP4 actually competitive on Spark, and to @saricles
for setting the bar on what GB10-targeted quantization recipes look
like. Both have been doing the hard runtime/recipe work this release
benefits from.

— Kai

Very cool! I was looking for more mature tooling to quantize larger models with my dual node setup. Would love to give it a spin as soon as you’re able to release more on the actual workflow you had to apply.

Thanks @serapis — that’s exactly the use case the pipeline targets.
Dual-node Spark is the only way I found to get clean NVFP4 for the
100B+ class; single-node hf_ptq.py is genuinely broken on GB10 for
anything near or above the 128 GB unified-memory threshold.

I’ll bundle the pipeline (distrib_quant.py + smoke test + ~600 lines
of documentation covering the six modelopt-0.43 quirks you have to
dodge) into a public GitHub repo and link it here in next few days. Right
now I’m running the same scripts against Behemoth-X-123B
(Mistral-Large base, 88 layers) as a generalization test — if that
converges, the release doubles as proof for the 123B class.
now I’m running the same scripts against Behemoth-X-123B
(Mistral-Large base, 88 layers) as a generalization test — if that
converges, the release doubles as proof for the 123B class.

If you have something specific on your list and want early access
before I publish, drop a note here with the model + rough timeline
and I’ll share what I have now.
Kai

@serapis — repo is up, Apache 2.0:

-> GitHub - KaletoAI/distrib-nvfp4: Distributed NVFP4 quantization pipeline for 100B+ LLMs on a 2-node NVIDIA DGX Spark cluster · GitHub

Quick map of what’s in there:

  • scripts/distrib_quant.py full distributed driver (Ray, 2 actors)
  • scripts/export_smoke_test.py 1-layer dry-run for ~1 min validation
  • scripts/run_quant.sh tmux wrapper with env-overridable paths
  • docs/debugging-notes.md long-form on the six modelopt-0.43 bugs
    we hit while getting clean output from
    vLLM (the input_scale=1.0 sidecar fix
    is the load-bearing one)

It’s model-agnostic — Llama 3.x and Mistral-Large work today (class
introspection on rms_norm_cls + rotary_emb_cls, not name-matching).
Has --resume-from-checkpoint and a Phase-5.5 disk-eviction pass so the
same pipeline handles 120-130B class without the UMA going past ~96%
during the export phase.

If you (or anyone here) run it and hit something the smoke test
didn’t catch, issues / PRs welcome.

Kai

Very cool, thanks! Now I’ll have to pick the right model to give it a spin 😅

Have you tried creating a quant of one of the bigger MoE models? What is the upper limit you were able to push? I see mentions of 105B per shard but wanted to confirm.

Glad it’s useful!

No MoE yet — that space is being driven by @entrpi (and antirez before
him); their DeepSeek-V4-Flash thread covers the right approach (NVFP4
only on the sparse-activated experts, FP8 on dense + attention). My
pipeline is dense-model-focused; for Mixtral / DeepSeek-V3 / Qwen3-MoE
you’d want their hybrid path.

On the per-shard limit: binding constraint is BF16 footprint during
Phase-3b calibration. ~115-120 GB resident weights per Spark before
Ray overhead + activation buffers trip kernel-OOM. Tested envelopes:

2-shard (2× Spark, IB):
  DeepSeek-R1-Distill-70B, Anubis-Pro-105B → comfortable
  Fallen-Command-A-111B → tight, quant done, serving blocked
                           on vLLM PR #41925
  Behemoth-X-123B → OOM'd; that's what motivated 3-shard

3-shard (+ RTX 3090 over 2.5 GbE LAN):
  Behemoth-X-123B with 41/41/6 split works.

Network was a non-issue — 2.5 GbE added only ~50 sec to a 30-min
calibration. Compute always dominated. Let me know which model you
pick 🙂