Hi all — first post here, sharing some Spark-cluster work and asking
the team a couple of things.
I just published an NVFP4 quantization of TheDrummer’s Anubis-Pro-105B-v1
(Llama-3.3-derived, 105B) on Hugging Face — afaict the first RP/storytelling
model at the 100B+ class in NVFP4 for Spark:
It was produced on a personal 2× DGX Spark cluster (256 GB combined UMA
over ConnectX-7), since the standard single-node modelopt hf_ptq.py path
silently OOM-kills on Spark for any model in this size class. The
distributed pipeline (Ray, layer-sharded across both Sparks, BF16
calibration → NVFP4 export via modelopt 0.43) is documented in the
model card, including the bugs I had to work around before vLLM would
serve the output correctly:
- accelerate.infer_auto_device_map misdetects GB10 unified memory
as a ~5.2 TB GPU and triggers silent OOM in single-node flows. - modelopt 0.43’s NVFP4 export writes input_activations.dynamic=false
but does NOT emit any input_scale keys. vLLM then registers an
uninitialized Parameter for them and produces garbage output until
you inject input_scale=1.0 sidecar keys for every quantized Linear
(840 in this case). - modelopt’s per-layer export-via-1-layer-template trick needs
vocab_size=2 (not 1) because the internal llm_dummy_forward feeds
torch.ones([1,2]) into the embedding. - Phase-6 OOM mitigation by clearing _calibrator was incompatible
with set_quantizer_by_cfg_context in modelopt 0.43 — had to keep
the calibrator object around.
Full six-fix list with rationale is in the model card under “Recent
Fixes”. Pipeline auto-applies these now.
End result: model loads in vLLM, output is coherent (literary
continuation, correct arithmetic), calibration came out clean
(good=420 zero=0 nan=0 per shard).
A few honest disclaimers up front:
-
The model uses NVFP4_DEFAULT_CFG with only lm_head in the ignore
list — NOT a GB10-bandwidth-tuned recipe like @saricles’ -GB10
releases. I didn’t earn the -GB10 suffix and don’t claim it. A
future v2 might apply that recipe. -
My benchmark numbers are stock vLLM 0.20.2rc1.dev53+g01b9b5af6
without VLLM_NVFP4_GEMM_BACKEND=marlin or the other env vars
@tbraun96 documented in “We unlocked NVFP4 on the DGX Spark”. They
likely undersell what’s achievable on the proper Spark runtime
(Avarok’s avarok/dgx-vllm-nvfp4-kernel image). Re-bench on that
stack to follow as a separate update. -
For “is the fast path active?”: on Spark the right startup-log
signals are “Using AttentionBackendEnum.FLASHINFER backend.” and,
for MoE models, “Using ‘MARLIN’ NvFp4 MoE backend”. Our model is
dense Llama so the MoE-marlin doesn’t apply — for the dense
matmuls, what matters is that vLLM doesn’t fall back to W4A16 or
generic marlin instead of the NVFP4-specific kernels.
Performance (stock vLLM, single Spark, no runtime env-var tuning,
no warmup besides what vLLM does automatically — re-bench on
Avarok’s image is coming):
| Context | Prompt-proc | Decode (per stream) | Memory |
|---|---|---|---|
| 4 096 | ~340 tok/s | ~3.1 tok/s | ~109 GB |
| 16 384 | ~650 tok/s | ~2.9 tok/s | ~109 GB |
| 32 768 | ~850 tok/s | ~2.9 tok/s | ~109 GB |
Aggregate at concurrency 4 (4K ctx): ~10.4 tok/s output, ~167 tok/s
total throughput.
Open questions for the team:
- Is the accelerate UMA-misdetection on Spark/GB10 tracked anywhere
internally? Known patch in flight? - Is there an official NGC modelopt build newer than 0.43 that fixes
the input_scale omission? Happy to validate with my AI Enterprise
eval against the same pipeline. - vLLM PR #41925 (Cohere2 NVFP4 loading) — on anyone’s radar? I
have a Fallen-Command-111B NVFP4 sitting on disk waiting for it.
Big thanks to @tbraun96 / Avarok-Cybersecurity for the MARLIN backend
port that made NVFP4 actually competitive on Spark, and to @saricles
for setting the bar on what GB10-targeted quantization recipes look
like. Both have been doing the hard runtime/recipe work this release
benefits from.
— Kai