I am posting this as a paying customer with 9× DGX Spark / GB10 nodes in production (~$38k invested) asking NVIDIA for an on-record response on the state of NVFP4 on this hardware. I want a reply from someone empowered to speak to the DGX Spark product roadmap, not another community comment please.
I bought this hardware specifically for NVFP4. The software to make that usable is not there. This post documents, with primary-source citations only, what NVIDIA promised, what the community has measured, what the community has fixed on its own, and the complete absence of any badged NVIDIA-staff response addressing the gap.
My deployment
- 9× GB10 DGX Spark / OEM equivalents, ~$4,000 each
- Cluster of 8 + single node
- 2× Mikrotik CRS804 fabric, ConnectX-7 on every node
- Head node running SGLang serving GLM-5.1 (754B / 40B active) FP8 — 24/7 agentic coding workload (because nvfp4 does not work without workarounds that are still non-optimal)
FP8 serving works. NVFP4 does not. That is the entire premise of this post.
What NVIDIA promised about NVFP4 on GB10 — verbatim quotes from NVIDIA’s own materials
DGX Spark hardware datasheet — Hardware Overview — DGX Spark User Guide
“Up to 1,000 TOPS (trillion operations per second) inference and up to 1 PFLOP (petaFLOP) at FP4 precision with sparsity”
“NVIDIA Blackwell Architecture with 5th Generation Tensor Cores”
DGX Spark product page — NVIDIA DGX Spark: AI Supercomputer on Your Desk
“up to one petaFLOP of FP4 AI performance”
Nemotron-3-Super-120B-A12B-NVFP4 model card — nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · Hugging Face
“Minimum GPU Requirement: 1× B200 OR 1× DGX Spark”
Deployment section includes: “vLLM on DGX Spark: To deploy the NVFP4 checkpoint on NVIDIA DGX Spark…”
Published benchmark hardware on the card: H100, H200, GB200. No GB10 / DGX Spark numbers are published anywhere.
Nemotron-3-Super announcement blog — Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog
“4x on NVIDIA B200 compared to FP8 on NVIDIA H100”
The headline NVFP4 speedup figure is measured on B200 — not on the GB10 hardware NVIDIA lists as a supported deployment target.
GTC 2026 NemoClaw blog — RTX PCs and DGX Spark Supercomputers Run AI Agents Locally | NVIDIA Blog
“Nemotron 3 Super is optimal for powering agents on the DGX Spark or NVIDIA RTX PRO workstations.”
NVIDIA’s own most recent marketing directs customers to run its flagship NVFP4-native model on DGX Spark while publishing zero GB10 benchmarks and delivering a software stack that does not exercise the FP4 tensor cores the hardware was sold on.
What actually runs on GB10 — NVFP4 measurements posted to NVIDIA’s own developer forum
Llama-3.3-70B-Instruct-NVFP4 on TensorRT-LLM (NVIDIA’s own flagship NVFP4 model, on NVIDIA’s own first-party inference stack):
- 5 tok/s decode, single Spark — TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s
- 2.5 tok/s decode, separate reporter — TRT LLM for Inference with NVFP4 safetensors slower than LM studio GGUF on the Spark
The second thread documents that vanilla GGUF Q4_K_M via LM Studio runs the same 70B model at 4.6–4.9 tok/s on the same Spark — NVIDIA’s NVFP4 model on NVIDIA’s TRT-LLM is slower than a non-NVIDIA quant on non-NVIDIA tooling.
Nemotron-3-Super-120B-A12B-NVFP4 (NVIDIA’s other flagship NVFP4 model, explicitly named on the model card as deployable on DGX Spark):
- 19–22 tok/s decode, single Spark — Nemotron-3-Super-120B at 20-22 tok/s Super Special Recipe
- 19 tok/s decode, 72 hours continuous — Running Nemotron 3 Super 120B on DGX Spark GB10— 72 hours continuous, 19 tok/s
- 24 tok/s decode, 2× Spark with tensor parallel — Nemotron-3-Super NVFP4 via vLLM TP=2 on 2x DGX Spark — 24 tok/s (ABI fix for cu130/cu132 mismatch)
Why 19–22 tok/s is unambiguously bad on this hardware
Nemotron-3-Super has 12B active parameters per forward pass. At NVFP4 (0.5 bytes per parameter), each decoded token reads ~6 GB of active weights from memory. GB10 has 273 GB/s of LPDDR5x bandwidth (per NVIDIA’s own datasheet).
- Theoretical bandwidth-limited ceiling: 273 ÷ 6 = ~45 tok/s for this model on this silicon, even if every other overhead were zero.
- Measured: 19–22 tok/s = 42–48% of the bandwidth ceiling.
- Even at 200GB/s across dual Mikrotik CRS804 switches that’s around 34 tok/s. I’m getting almost 200GB/s with a raw cluster test.
- What a reasonable well-optimized NVFP4 path should deliver on this hardware: ~30–40 tok/s (60–80% bandwidth efficiency is routine on GB10 in other configurations).
- Put plainly by a community member on the HuggingFace model-card discussion: “This is a model with 12B activated parameters per token. It should generate at least 30 t/s.” — nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · All this talk about NVFP4 - why is it dog slow?
The hardware is leaving roughly half its achievable throughput on the floor on NVIDIA’s own NVFP4-native flagship model. This is not a memory-bandwidth limitation. This is a kernel / software-stack limitation. The FP4 tensor cores NVIDIA marketed are not being exercised effectively.
The HuggingFace discussion thread on the model card is titled “All this talk about NVFP4 — why is it dog slow?” — nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · All this talk about NVFP4 - why is it dog slow? . NVIDIA has posted no substantive reply.
Direct A/B comparisons on identical models: NVFP4 loses on the hardware it was built for
Same model. Same hardware. Same framework. Different precision. Community-reported numbers with URLs.
Qwen3-Next-80B-A3B-Instruct, single Spark, vLLM — Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark
| Precision | Decode tok/s |
|---|---|
| FP8 | 44.56 |
| NVFP4 | 39.54 |
| AWQ 4-bit | 32.82 |
FP8 beats NVFP4 by 12% on the same 4-bit-class memory footprint hardware path.
Qwen3-VL-235B, 2× Spark, vLLM — PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM
| Precision | Decode tok/s (1 req) | Decode tok/s (10 concurrent) |
|---|---|---|
| AWQ 4-bit | 24.93 | 42.11 |
| NVFP4 | 18.91 | 35.58 |
AWQ beats NVFP4 by 18–32% on the precision NVIDIA has been most aggressively marketing.
MiniMax-M2.7, 2× Spark, vLLM — MiniMax M2.7 NFVP4 Recipe & Benchmarks
| Precision | Decode tok/s |
|---|---|
| AWQ 4-bit | 39.39 |
| NVFP4 (FlashInfer-cutlass, fully optimized) | 25.69 |
AWQ beats NVFP4 by 53%.
Nemotron-3-Nano-30B-A3B-NVFP4, single Spark, vLLM — Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)
| NVFP4 backend | Decode tok/s | GPU memory |
|---|---|---|
| Default (FlashInfer — what ships from NVIDIA) | 42.6 | 39 GB |
| Community Marlin patch | 50.0 (+16%) | 32 GB (−7 GB) |
The default NVFP4 path NVIDIA ships costs 16% throughput and wastes 7 GB of GPU memory compared to a community-built patch.
The pattern is unambiguous: on GB10, NVFP4 is currently slower than FP8, slower than AWQ, and slower than community-patched NVFP4 using non-default backends. The headline format of this hardware is the worst practical format option on it.
The community has documented the underlying issues exhaustively
Threads on NVIDIA’s own developer forum — all customer-opened, none with a badged-NVIDIA-staff resolution:
- DGX Spark SM121 software support is severely lacking - official roadmap needed — DGX Spark (SM121) Software Support is Severely Lacking - Official Roadmap Needed
- Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)? — Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)?
- SM121/GB10 native NVFP4 compute - seeking guidance on software support — SM121 (GB10) native NVFP4 compute — seeking guidance on software support
- PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM (10+ pages) — PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM
- FP4 on DGX Spark — Why It Doesn’t Scale Like You’d Expect — FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect
- NVFP4 issue root cause? — NVFP4 issue root cause?
- Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler — Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler
Community engineering work that NVIDIA has not adopted:
- SM121 CUTLASS Kernel Optimization Results: NVFP4 356 TFLOPS, MoE Grouped GEMM — community-built kernels proving the silicon can do substantially better than the shipped stack delivers — SM121 CUTLASS Kernel Optimization Results: NVFP4 356 TFLOPS, MoE Grouped GEMM on DGX Spark
- Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) — Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark)
- From 20 to 35 tps on Qwen3-Next NVFP4 with FlashInfer 12.1f — From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f
- Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) — Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch)
GitHub-side issue tracking:
- NVIDIA/cutlass #2947 — CUTLASS FP4 MMA ops are restricted to sm_100a/sm_103a, explicitly excluding SM121 — [BUG] [Blackwell] Enable FP4/tcgen05 support for sm_121 (DGX Spark) in CuTe DSL · Issue #2947 · NVIDIA/cutlass · GitHub (related: #2800, #2802)
- NVIDIA/TensorRT-LLM #11368 — sm_121 FP4 unsupported — [Bug] FP4 CUTLASS GEMM fails on GB10 (SM121) — shared memory overflow from B200-sized tile configs · Issue #11368 · NVIDIA/TensorRT-LLM · GitHub
- flashinfer-ai/flashinfer #2776 — “NVFP4 MoE models crash on GB10 (SM121) during CUDA graph capture” — [Bug] NVFP4 MoE models crash on GB10 (SM121) during CUDA graph capture · Issue #2776 · flashinfer-ai/flashinfer · GitHub
- vLLM PR #37700 — community FLA fix for SM12x, 1.8× decode speedup measured on Spark, still not merged — [Bugfix] Fix FLA Hopper/TMA misclassification on SM12x desktop Blackwell by RobTand · Pull Request #37700 · vllm-project/vllm · GitHub
- eugr/spark-vllm-docker #143 — vLLM hardcoded
enable_sm120_onlyguard excludes SM121 — vLLM FP8 crash on NVIDIA GB10 / DGX Spark (SM12.1) — "This kernel only supports sm120" · Issue #143 · eugr/spark-vllm-docker · GitHub - NVIDIA/dgx-spark-playbooks #22 — closed without NVIDIA-affiliated technical resolution — DGX Spark (GB10/sm_121) currently lacks tcgen05, DSMEM, TMEM, and TMA/multicast support · Issue #22 · NVIDIA/dgx-spark-playbooks · GitHub
The customer community has built and shipped the patches. NVIDIA has not adopted them, has not provided official equivalents, let alone, published a roadmap or reassured it’s customers.
The asymmetry
After reading every thread and GitHub issue linked above, I have not found a single response from a verifiable, badged NVIDIA staff member that addresses any of the following:
- A roadmap or timeline for SM121 NVFP4 software support reaching parity with SM100
- An official acknowledgement of the dense (non-sparse) FP4 peak on GB10 — the datasheet headline is “1 PFLOP at FP4 precision with sparsity”; the dense number is not published with equivalent prominence
- A commitment to upstream the community SM121 patches (PR #37700, Marlin NVFP4 backend, SM121 CUTLASS grouped-GEMM work) into NVIDIA’s first-party container images
- Clarification of which FP4 MMA functionality SM121 silicon actually implements vs what is software-disabled — community reads of CUTLASS source suggest hardware-level gaps; NVIDIA documentation does not confirm or deny
- An ETA on
nvcr.io/nvidia/vllmimages with native SM121 FP4 paths enabled by default
If an official response exists that I missed, please reply here with a link.
What I am asking for, specifically
I want, on-record, from someone empowered to speak for the DGX Spark product team or the CUTLASS / TensorRT-LLM engineering groups:
-
Is SM121 NVFP4 parity with SM100 on the roadmap? Yes, no, or partial. If no, say so plainly so customers can architect around the limitation.
-
Publish the dense FP4 peak on GB10. The sparsity-qualified “1 PFLOP” number is in the datasheet. Publish the dense number with equivalent prominence. Customers deserve to be able to compare like-for-like.
-
Commit to upstreaming the community SM121 fixes — PR #37700, the Marlin NVFP4 backend, the SM121 CUTLASS grouped-GEMM work — into NVIDIA’s first-party container images, or ship official equivalents. Customers should not be the ones patching CUTLASS and FlashInfer to get NVFP4 to work on hardware NVIDIA sold for NVFP4.
- Seriously…I had expected to load up models and spend my time doing real work, not deal with stability issues for a box that was marketed to work.
-
Clarify the silicon. One paragraph from NVIDIA on which FP4 MMA functionality SM121 hardware implements vs what is software-disabled resolves the central technical question of this entire debate.
-
Publish Nemotron-3-Super-120B-A12B-NVFP4 numbers on DGX Spark. Your own GTC 2026 blog calls DGX Spark “optimal” for this model. The model card lists DGX Spark as a supported deployment target. On a single Spark it runs at 19–22 tok/s, roughly half the bandwidth-limited ceiling for a 12B-active model on 273 GB/s memory. Either fix the kernels so the measured number approaches the achievable, or correct the marketing.
Why this matters beyond me
NVFP4 was the headline feature of this hardware. The model releases (Nemotron-3-Nano, Nemotron-3-Super) were built around it. The marketing (“5th Generation Tensor Cores”, “1 petaFLOP FP4”, “optimal for DGX Spark”) was built around it. I and many others bought into this platform specifically because NVIDIA positioned NVFP4 on GB10 as a first-class path.
On measured reality, NVFP4 on GB10 is slower than FP8, slower than AWQ, and slower than community-patched NVFP4. The fix appears to exist in community patches but is not in shipping NVIDIA software. The asymmetry between how much customer engineering work is on public record here and how little NVIDIA engagement is on public record here is not sustainable for a product NVIDIA continues to actively market under the Grace Blackwell brand.
NVIDIA built and sold this product with NVFP4 as the headline. Honor it please!
I can provide additional benchmarks, traces, a repro environment from my fleet if that helps an NVIDIA engineer engage substantively. I would prefer a public reply on this thread so other customers can plan around the answer.
PLEASE, FIX THIS! And please, give us some answers and commitments…the hardware is not cheap!
Thank you.