FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect

As baristankut and eugr shown in their posts, and after a lot of benchmarking and digging into the architecture, here’s what i like to summarize (llm formated, no molt).

FP16 → FP8: scales. Half the bytes, tensor cores handle it natively via the Ampere MMA path, real-world speedup ~1.3–1.7x. Solid.

FP8 → FP4: doesn’t scale. You’d expect another ~2x from halving the data again. But on SM12x three things have been cut from the “blackwell”:

  1. No tcgen05. The 5th gen tensor core instructions that make FP4 fast on B200 (SM100) don’t exist on the Spark. No Tensor Memory, no native FP4 compute path. FP4 gets dequantized to FP8/FP16 before computation… no saving memory bandwidth.

  2. 99 KB shared memory. SM100 has 228 KB + 256 KB TMEM. The Spark has 99 KB. Smaller buffer means smaller GEMM tiles, more frequent reloads from the already limited 273 GB/s bus. This hits everything — attention, MoE, even speculative decoding — but FP4 hardest.

  3. FP4 + small shared memory is the worst combo. It must be a feature.

And then there’s accuracy. NVIDIA has done impressive work with NVFP4’s micro-block scaling, but 4 bits is 4 bits. Others in the forum have already shown the quality loss is noticeable. My math tests too.

Use FP8 when the model fits. Use FP4 when it doesn’t.

For me its no surprise anymore, that you may read “minimum” so often if it comes to dgx spark and performance. It is best minimum you can have.

Has anyone seen different NVFP4 speedups?

some results w/o cutlass fixes. 590 driver, 26.01 image, vllm 0.15, sglang 0.5.8,

DGX Spark (GB10, 273 GB/s LPDDR5x)

BF16:  273 / (3B * 2) = 273 / 6  =  45.5 tok/s
FP8:   273 / (3B * 1) = 273 / 3  =  91.0 tok/s
NVFP4: 273 / (3B * 0.5) = 273 / 1.5 = 182.0 tok/s   (won't work)
Quant MM-Kernel Theorie vLLM vLLM+EAGLE3 SGLang +EAGLE3 +DFlash
BF16 (2B) cuBLAS+Triton MoE 45.5 28.8 [7] 29.0 [7] 31.7 16.4 17.6
FP8 online (1B) Triton FP8 MoE 91 45.9 [M], 46.6 [C] 55.0 21.2 X [5]
W8A8 ct (1B) 91 X [1]
FP8 block (1B) Triton FP8 MoE 91 60.1! [2]
NVFP4 (0.5B) CUTLASS FP4 [3b] 182 X [3]

[M] = Marlin FP8 weight-only (kein echtes FP8 compute auf SM121)
[C] = CUTLASS FP8 scaled_mm (echtes FP8 compute, MoE weiterhin Marlin)

Spiegel 2 (RTX PRO 6000, 1800 GB/s GDDR7)

BF16:  1800 / (3B * 2) = 1800 / 6  =  300 tok/s
FP8:   1800 / (3B * 1) = 1800 / 3  =  600 tok/s
NVFP4: 1800 / (3B * 0.5) = 1800 / 1.5 = 1200 tok/s
Quant MM-Kernel Theorie vLLM vLLM+EAGLE3 SGLang +EAGLE3 +DFlash
BF16 (2B) cuBLAS / flashinfer 300 143.2 159.0 169.8 115.3
W8A8 ct (1B) 600 X [4] X [4]
FP8 block (1B) Marlin FP8 / flashinfer 600 125.9 182.2 X [8] X [8]
NVFP4 (0.5B) CUTLASS FP4 1200 121.4! [6] 176.2! [6] 136.4! [6]

Fussnoten

  • [1] SGLang FusedMoE: “Unsupported scheme” fuer INT8 compressed-tensors
  • [2] Block-FP8 Triton Kernel: 60 tok/s aber 0% Math — Garbage Output auf SM121 (! = Qualitaet kompromittiert)
  • [3] NVFP4 auf SM121: Ungepatcht → CUDA Graph crash “Unsupported SM version: 121”. Mit CUTLASS 4.4 + admissible_archs Patch → 356 TFLOPS (BTankut)
  • [3b] CUTLASS FP4 auf SM121 bewiesen (BTankut), braucht CUTLASS 4.4 + sm_121a Patch + Tile-Tuning
  • [4] W8A8 INT8: “Int8 not supported on SM120” (vLLM) / “Unsupported FusedMoe scheme” (SGLang). INT8 nur auf SM<100
  • [5] DFlash PR Image hat sgl-kernel 0.3.16 < required 0.3.20. FP8-Online braucht neueres sgl-kernel als DFlash-Image bietet
  • [6] NVFP4 auf SM120 (RTX PRO 6000): 70-72% Math (vs 80% bei BF16/FP8). CUTLASS FP4 Fallback, FlashInfer MoE Kernels nicht verfuegbar. Nur ~10% der theoretischen 1200 tok/s. Vanilla 121.4, +EAGLE3 176.2 tok/s
  • [7] vLLM auf DGX Spark: Unified Memory Profiler-Bug erzwingt VLLM_USE_V1=0 --enforce-eager --gpu-memory-utilization 0.12 --max-model-len 4096. Kein fairer Vergleich mit SGLang (hat CUDA Graphs + groesseren KV-Cache)
  • [8] SGLang FP8 block auf SM120: DeepGEMM “Unknown recipe” + FlashInfer FP8 “capability 120 not supported”. Kein funktionierender FP8 block-GEMM Kernel auf SM120 in SGLang
  • [9] SGLang NVFP4+EAGLE3: ModelOpt NVFP4 Loader inkompatibel mit accelerate-basiertem EAGLE3 Loader. “state dictionary corrupted” weil ModelOpt FP4 Keys nicht erkannt werden

I’d correct this to “use AWQ when it doesn’t”. AWQ (INT4) quants are fast, and quality loss is very small as it keeps activation weights at FP16 and uses a calibration scale for quantized weights.

another followup… I fixed flashinfer cutlass 4.3+ (HEAD) to run FP8. So the step FP8 → FP4 is more comparable in same matrixmultiplication technology.

Triton FP8 MoE is a bit slower, so the FP4 using cutlass looks like a performance gain, but it is not.

As far as I understand, consumer Blackwell (sm121/sm121) has native FP4 compute, just no tcgen5.
Instead of tensor memory, you can use shared memory or cuda registers.

@johnny_nv - feel free to chime in, you’ve spent a lot of time with this.

Which component handles it? It supports nvfp4 of course but this does not affect memory throughput (everything you may save is gone in computing cycles). And decoding in the even smaller shared memory is just more computing afterall.

Tensor/CUDA cores support FP4 natively, so it should be possible to avoid dequantization. TBF, I’m not an expert in CUDA programming, so I’ll let others to chime in :)

loading weights and activations using a smaller dtype is a big win on the Spark. We are often bandwidth limited.

I don’t know the sm121 ISA (I couldn’t find it online), but I did spend a minor amount of time reading SASS in cuda-gdb and quite a lot of time in the flashinfer and cutlass codebases.

There is native support for activations in fp8 and weights in fp4, at least. Not all combinations of activations and weights are available in flashinfer and cutlass. I’m not 100% if it’s a limitation of the hardware or of flashinfer/cutlass, but I assume it’s hardware.

I believe it is true that there is no TMEM (tcgen5) support in the sm121.

after fixing CUTLASS, Flashinfer and sg_kernel

Ziel

Vollstaendiger Vergleich aller Kombinationen aus Engine, Spekulation, Quantisierung und Plattform
fuer Qwen3-Coder-30B-A3B-Instruct (MoE 30B, 3B aktiv).

Referenz

  • DFlash-Autor testet mit RTX PRO 6000 + SGLang + LLaMA-3.1-8B: gute Ergebnisse bei Throughput,
    aber Probleme mit Math-Accuracy unter DFlash (Modell rechnet nicht mehr richtig).
  • BTankut: GLM-4.7-FP8 auf DGX Spark Cluster mit EAGLE3 + SGLang, 20-27% Speedup.

Plattformen

System GPU VRAM Bandwidth Arch
DGX Spark GB10 128 GB unified 273 GB/s LPDDR5x SM121, aarch64
Spiegel 2 RTX PRO 6000 96 GB GDDR7 1800 GB/s SM120, x86_64

Testmatrix

DGX Spark (SM121, 273 GB/s)

# Engine Spekulation Quant MM-Kernel Status tok/s Math
1 SGLang keine BF16 cuBLAS+Triton MoE done 31.7 78%
2 SGLang EAGLE3 BF16 cuBLAS+Triton MoE done 16.4 78%
3 SGLang DFlash BF16 cuBLAS+Triton MoE done 17.6 78%
4 SGLang keine FP8 (block) Triton FP8 MoE done 60.1 0% GARBAGE
4b SGLang keine FP8 (online) Triton FP8 MoE done 55.0 78%
4c SGLang EAGLE3 FP8 (online) Triton FP8 MoE done 21.2 78%
4d SGLang DFlash FP8 (online) Triton FP8 MoE X [5]
5 SGLang keine W8A8 (ct) X [1]
6 SGLang EAGLE3 W8A8 (ct) X [1]
7 SGLang DFlash W8A8 (ct) X [1]
4e vLLM keine FP8 (online) Marlin FP8 w-only done 45.9 78%
4f vLLM keine FP8 (online) CUTLASS scaled_mm done 46.6 80%
8 SGLang keine NVFP4 CUTLASS FP4 [3b] blocked [3]
9 SGLang EAGLE3 NVFP4 CUTLASS FP4 [3b] blocked [3]
10 SGLang DFlash NVFP4 CUTLASS FP4 [3b] blocked [3]
35 vLLM-next keine BF16 TRITON MoE+FLASHINFER done 30.6 80%
36 vLLM-next EAGLE3 BF16 TRITON MoE+FLASHINFER done 28.5 80%
37 vLLM-next keine FP8 (online) CUTLASS scaled_mm+TRITON FP8 MoE done 50.5 78%
38 vLLM-next EAGLE3 FP8 (online) CUTLASS scaled_mm+TRITON FP8 MoE done 51.0 78%
39 vLLM-next keine NVFP4 FLASHINFER_CUTLASS done [13] 65.0 74%
40 vLLM-next EAGLE3 NVFP4 FLASHINFER_CUTLASS done [13] 68.1 72%
41 SGLang-next keine BF16 cuBLAS+flashinfer done 31.1 78%
42 SGLang-next EAGLE3 BF16 cuBLAS+flashinfer done [14] 13.1 78%
43 SGLang-next DFlash BF16 cuBLAS+flashinfer done [14] 20.5 78%
44 SGLang-next keine FP8 (online) Triton FP8 MoE done 52.7 80%
45 SGLang-next EAGLE3 FP8 (online) Triton FP8 MoE done [14] 24.5 78%
46 SGLang-next DFlash FP8 (online) Triton FP8 MoE done [14] 41.4 80%
47 SGLang-next keine NVFP4 CUTLASS FP4 SM120 done [19] 66.0 74%
48 SGLang-next EAGLE3 NVFP4 CUTLASS FP4 SM120 done [14][19] 27.4 70%
49 SGLang-next DFlash NVFP4 CUTLASS FP4 SM120 done [19] 54.4 74%

Spiegel 2 (SM120, 1800 GB/s)

# Engine Spekulation Quant MM-Kernel Status tok/s Math
11 vLLM keine BF16 cuBLAS done 143.2 80%
12 vLLM EAGLE3 BF16 cuBLAS done 159.0 80%
13 vLLM keine FP8 (block) Marlin FP8 done 125.9 80%
14 vLLM EAGLE3 FP8 (block) Marlin FP8 done 182.2 80%
15 vLLM keine W8A8 (ct) X [4]
16 vLLM EAGLE3 W8A8 (ct) X [4]
17 SGLang keine BF16 cuBLAS+flashinfer done 169.8 78%
17b SGLang-next keine BF16 cuBLAS+flashinfer done 169.0 78%
18 SGLang EAGLE3 BF16 cuBLAS+flashinfer done 115.3 78%
18b SGLang-next EAGLE3 BF16 cuBLAS+flashinfer done 80.2 78%
19 SGLang-next DFlash BF16 cuBLAS+flashinfer done 117.3 78%
20 SGLang keine FP8 (block) flashinfer FP8 X [8]
21 SGLang EAGLE3 FP8 (block) flashinfer FP8 X [8]
22 SGLang DFlash FP8 (block) flashinfer FP8 X [8]
23 SGLang keine W8A8 (ct) X [4]
24 SGLang EAGLE3 W8A8 (ct) X [4]
25 SGLang DFlash W8A8 (ct) X [4]
26 SGLang keine NVFP4 CUTLASS FP4 SM120 done 136.4 72%
27 SGLang EAGLE3 NVFP4 CUTLASS FP4 SM120 X [9]
28 SGLang DFlash NVFP4 CUTLASS FP4 SM120 pending
29 vLLM-next keine BF16 TRITON MoE+FLASH_ATTN done 140.9 80%
30 vLLM-next EAGLE3 BF16 TRITON MoE+FLASH_ATTN done 147.4 80%
31 vLLM-next keine FP8 (block) TRITON FP8 MoE done 135.7 78%
32 vLLM-next EAGLE3 FP8 (block) TRITON FP8 MoE done 166.5 80%
33 vLLM-next keine NVFP4 FLASHINFER_CUTLASS done 157.9 80%
34 vLLM-next EAGLE3 NVFP4 FLASHINFER_CUTLASS done 183.4 74%
50 SGLang-next keine FP8 (block) flashinfer FP8 X [8]
51 SGLang-next EAGLE3 FP8 (block) flashinfer FP8 X [8]
52 SGLang-next DFlash FP8 (block) flashinfer FP8 X [8]
53 SGLang-next keine NVFP4 CUTLASS FP4 SM120 done [17] 139.5 74%
54 SGLang-next EAGLE3 NVFP4 CUTLASS FP4 SM120 done [14][17] 84.1 70%
55 SGLang-next DFlash NVFP4 CUTLASS FP4 SM120 done [17] 166.8 70%
56 SGLang-next keine FP8 (online) X [15]
57 SGLang-next EAGLE3 FP8 (online) X [15]
58 SGLang-next DFlash FP8 (online) X [15]

Quantisierungsformate

Kuerzel Format quant_method Bytes/Param Modellpfad
BF16 BFloat16 - 2 Qwen3-Coder-30B-A3B-Instruct
FP8 (block) Block FP8 [128,128] fp8 (native HF) 1 qwen3-coder-30B-fp8
FP8 (online) Per-tensor dynamic FP8 –quantization fp8 1 Qwen3-Coder-30B-A3B-Instruct (BF16 + online quant)
W8A8 (ct) Per-channel INT8 compressed-tensors 1 Qwen3-Coder-30B-A3B-Instruct-W8A8
NVFP4 NVIDIA FP4 modelopt 0.5 Qwen3-Coder-30B-A3B-Instruct-FP4

GEMM-Kernel-Bibliotheken nach Quantisierung

Welche Library fuehrt die Matrix-Multiplikation (GEMM) fuer welches Quantisierungsformat aus?

Quant Kernel-Lib Beschreibung SM90 (Hopper) SM100 (Blackwell) SM121 (GB10)
BF16 cuBLAS Standard NVIDIA BLAS Ja Ja Ja
BF16 Triton MoE Triton Fused MoE Kernel Ja Ja Ja
FP8 DeepGEMM JIT FP8 GEMM (DeepSeek), braucht tcgen05.mma Ja SM100 DC only Nein
FP8 CUTLASS FP8 NVIDIA CUTLASS Ja Ja Moeglich (SM120 Support)
FP8 Triton FP8 Triton Fallback (langsam) Ja Ja Ja
FP8 torch._scaled_mm PyTorch nativer Fallback Ja Ja Ja
W8A8 INT8 CUTLASS INT8 NVIDIA CUTLASS Ja Nein Nein
NVFP4 FlashInfer FP4 MoE FlashInfer + CUTLASS FP4 ? Sollte Ja [13]
NVFP4 CUTLASS FP4 CUTLASS Blockscaled NVFP4 Ja Ja Moeglich (SM120 Support)
NVFP4 CuTe DSL GEMM Python-basierte CUTLASS Kernels ? Ja Moeglich (SM120 Support)
NVFP4 Marlin FP4 Marlin Kernel Ja ? Nein

Aktueller Status auf unseren Plattformen

Plattform FP8 Kernel aktiv NVFP4 Kernel aktiv Folge
Spiegel 2 (SM120) Triton FP8 (Fallback!) CUTLASS FP4 (Fallback!) FP8 langsamer als BF16, NVFP4 nur 10% der Theorie
DGX Spark (SM121) Triton FP8 FLASHINFER_CUTLASS [13] FP8-online + NVFP4 CUTLASS funktioniert! 65.0 tok/s

Strategische Bewertung: CUTLASS vs DeepGEMM

Kriterium DeepGEMM CUTLASS
SM120 Support Nein (tcgen05.mma fehlt) Ja (SM120 Familie)
SM121 Support Nein (tcgen05.mma fehlt) Ja (mit Patch)
FP8 GEMM Ja (SM90+SM100 DC only) Ja (SM90+SM100+SM120+SM121)
NVFP4 GEMM Nein Ja (blockscaled dense, grouped, sparse)
Einstiegshuerde Niedrig (JIT) Gesunken durch CuTe DSL (Python-basiert)
vLLM Integration Ja Ja
SGLang Integration Ja Ja (CuteDSL-GEMM ersetzt FP8-DeepGEMM bei FP4)

Fazit: CUTLASS ist der einzig gangbare Pfad fuer SM120 und SM121. DeepGEMM NUR auf SM90 (Hopper) und SM100 (Datacenter Blackwell).

DeepGEMM — Ergebnis

DeepGEMM getestet auf beiden Plattformen — scheitert auf beiden:

  • SM120 (RTX PRO 6000): tcgen05.mma not supported on .target sm_120a
  • SM121 (DGX Spark GB10): tcgen05.mma not supported on .target sm_121a
  • tcgen05.mma/fence = Tensor Core Gen 5 Instruktionen, exklusiv SM100 Datacenter
  • Software-Patches (arch-routing, symbol extraction) funktionieren, aber die GEMM-Kernels selbst kompilieren nicht

SM120/SM121 Architektur-Detail (korrigiert nach BTankut)

SM120/SM121 haben kein tcgen05, kein Multicast, kein 2-SM MMA (CTA Pairs).
Die Tensor Cores sind leistungsfaehig (356 TFLOPS NVFP4, 188 TFLOPS FP8) aber architekturbedingt
verschieden von SM100 (Datacenter Blackwell).

Unterschiede zu SM100:

  • tcgen05.mma PTX-Instruktion: nicht unterstuetzt auf sm_120a/sm_121a
  • Kein Multicast, keine CTA Pairs, nur TN-Layout, kein PDL/GDC
  • SMEM-Limits: 101.376 Bytes/SM (≈100KB, wie RTX 4090) vs 228KB auf SM100
  • Cluster fest 1x1x1

Wie funktioniert CUTLASS dann?
CUTLASS 3x (CuTe API) hat eigene SM120-Kernelpfade (z.B. gemm_grouped_sm120_M128_BS_group1),
die die SM120-Tensor-Cores ueber die CuTe-Abstraktion ansprechen — ohne tcgen05.mma PTX.
Braucht sm_121a in BlockScaledMmaOp.admissible_archs (CUTLASS Issue #2800).

Konsequenz:

  • CUTLASS 3x (CuTe API): funktioniert auf SM120/SM121 (FP8, NVFP4, BF16 grouped GEMM)
  • DeepGEMM (inline PTX tcgen05.mma): scheitert auf SM120/SM121
  • Triton: behandelt SM121 als SM80 (Fallback)
  • Default MoE-Configs von SGLang/vLLM: crashen auf GB10 (brauchen ~147KB SMEM, nur 101KB vorhanden)

CUTLASS SM120/SM121 — Voraussetzungen (BTankut-Analyse)

Basierend auf BTankuts Arbeit (GitHub: BTankut/dgx-spark-sglang-moe-configs) fuer SM121:

Shared Memory Limits — das zentrale Constraint:

Plattform SM SMEM/SM SMEM/Block SMs Folge
DGX Spark (GB10) SM121 101.376 B (~99 KB) 48 KB 48 Default-Tiles crashen
Spiegel 2 (RTX PRO 6000) SM120 101.376 B (~99 KB) 48 KB 188 Gleiche Limits!
Datacenter (B200) SM100 228 KB 164 KB 168 Default-Tiles passen

4 Probleme, die geloest werden muessen:

# Problem Beschreibung Loesung
1 CUTLASS erkennt SM121 nicht BlockScaledMmaOp.admissible_archs hat SM121 nicht Einzeiler-Patch: sm_121a hinzufuegen (CUTLASS Issue #2800)
2 Default-Tile-Sizes sprengen SMEM SGLang/vLLM Heuristiken gehen von 128-228 KB aus, SM120/121 hat nur 100-100 KB Tile-Sweep innerhalb 100 KB Budget
3 Keine MoE-Configs fuer GB10 SGLang hat geraetespezifische Config-Files — fuer NVIDIA_GB10 existierten keine 4 Config-Files generiert (2 pro Triton-Version)
4 EAGLE crasht ohne tuned Configs EAGLE Speculative Decoding erzeugt groessere Batches → triggert SMEM-Overflow Tuned Configs sind Voraussetzung fuer EAGLE, nicht nur Optimierung

BTankut-Ergebnisse (SM121, GLM-4.7-FP8, 4x DGX Spark TP=4):

Szenario MoE Configs EAGLE Ergebnis
A Optimized Off 16.77 tok/s
B Default Off 15.77 tok/s (-6.3%)
C Default On OutOfResources CRASH
D Optimized On 20-27 tok/s

CUTLASS GEMM Peak-Performance (SM121, BTankut):

  • NVFP4 dense: 356 TFLOPS (71% von 500 TFLOPS dense FP4 peak, 1 PFLOPS spec inkl. Sparsity)
  • FP8 dense: 188 TFLOPS
  • MoE Grouped GEMM (8 & 64 Experts): 120-154 TFLOPS (tile-abhaengig)
    • Tile 256×128: 154 TFLOPS (optimal fuer Prefill / grosse Batches)
    • Tile 128×128: ~147 TFLOPS (optimal fuer Decode / kleine Batches)

Rezept fuer SM120/SM121: CUTLASS 4.4 + admissible_archs Patch + Tile-Tuning (101KB SMEM Budget)

BTankut Docker-Image: ghcr.io/btankut/sglang-spark-glm47:latest (MoE Configs + Patches vorinstalliert)

Was fehlt / was wuerde helfen

Aktion Erwarteter Effekt
DeepGEMM auf Spiegel 2 GESCHEITERT: tcgen05.mma nicht auf SM120a
DeepGEMM auf DGX Spark GESCHEITERT: tcgen05.mma nicht auf SM121a
vLLM 26.01 auf Spiegel 2 Image-Pull scheitert an NIC-Corruption (tls: bad record MAC)
FP8 CUTLASS per-tensor auf SM120 GESCHEITERT: DataType.e4m3 zu SM120 hinzufuegen kompiliert nicht. SM120 Kernel-Templates nutzen Block-Scaled Collective-Builder, FP8×FP8 braucht Standard-Builder → CUTLASS Template-Error. Python-Gates (5 Patches) funktionieren, FLASHINFER_CUTLASS wird selektiert, aber JIT-Kernel-Kompilierung scheitert. Braucht eigenes SM120 FP8×FP8 Kernel-Template
CUTLASS FP8 block-scale auf SM120 Verworfen: CutlassFp8BlockScaleGemmRunner Hopper-only, nicht portierbar ohne TRT-LLM Kernel-Arbeit
CUTLASS FP8/NVFP4 auf SM121 aktivieren BTankut-Configs bereits vorhanden, Integration in SGLang/vLLM. BTankut: “CUTLASS 4.4.0 with sm_121a target works”
sgl_kernel fuer SM120 bauen SGLang auf Spiegel 2 braucht SM120-Kernels (pip-Version hat nur SM100)
FlashInfer FP4 MoE auf SM120 NVFP4 deutlich schneller (>300 tok/s statt 121)

Theoretische Limits (zero context, single request)

Plattform BF16 FP8/W8A8 NVFP4
DGX Spark (273 GB/s) 45.5 tok/s 91 tok/s 182 tok/s
Spiegel 2 (1800 GB/s) 300 tok/s 600 tok/s 1200 tok/s

Draft-Modelle

Drafter Pfad Groesse
EAGLE3 /data/tensordata/SGLang-EAGLE3-Qwen3-Coder-30B-A3B ~500 MB
DFlash /data/tensordata/Qwen3-Coder-30B-A3B-DFlash ~900 MB

Ergebnistabellen (Ziel)

Am Ende stehen zwei Tabellen — eine je Plattform. Alle Werte in tok/s (long, ~400 Tokens).

Theorie-Formel (memory-bound, single request, zero context)

tok/s = Bandwidth / (aktive_Parameter * Bytes_pro_Parameter)
  • Modell: Qwen3-Coder-30B-A3B-Instruct (MoE, 3B aktive Parameter pro Token)
  • BF16: 2 Bytes/Param → 3B * 2B = 6 GB pro Forward Pass
  • FP8/W8A8: 1 Byte/Param → 3B * 1B = 3 GB pro Forward Pass
  • NVFP4: 0.5 Bytes/Param → 3B * 0.5B = 1.5 GB pro Forward Pass
  • Context/KV-Cache Overhead hier nicht beruecksichtigt (zero context)

Legende:

  • X = kaputt/nicht nutzbar (startet nicht oder Garbage)
  • ! = Math-Plausibilitaetstest gescheitert (tok/s-Wert mit Kompromissen bei Qualitaet)
  • Blanko = noch nicht getestet

DGX Spark (GB10, 273 GB/s LPDDR5x)

BF16:  273 / (3B * 2) = 273 / 6  =  45.5 tok/s
FP8:   273 / (3B * 1) = 273 / 3  =  91.0 tok/s
NVFP4: 273 / (3B * 0.5) = 273 / 1.5 = 182.0 tok/s
Quant MM-Kernel Theorie vLLM vLLM+E3 vLLM-next +E3 SGLang +EAGLE3 +DFlash SG-next [14] +E3 [14] +DF [14]
BF16 (2B) cuBLAS / TRITON MoE 45.5 28.8 [7] 29.0 [7] 30.6 28.5 31.7 16.4 17.6 31.1 13.1 20.5
FP8 online (1B) CUTLASS+TRITON FP8 MoE 91 45.9 [M], 46.6 [C] 50.5 51.0 55.0 21.2 X [5] 52.7 24.5 41.4
W8A8 ct (1B) 91 X [1]
FP8 block (1B) Triton FP8 MoE 91 60.1! [2]
NVFP4 (0.5B) FLASHINFER_CUTLASS / CUTLASS FP4 182 65.0 [13] 68.1 [13] X [3] 66.0! [19] 27.4! [14][19] 54.4! [19]

[M] = Marlin FP8 weight-only (kein echtes FP8 compute auf SM121)
[C] = CUTLASS FP8 scaled_mm (echtes FP8 compute, MoE weiterhin Marlin)

Spiegel 2 (RTX PRO 6000, 1800 GB/s GDDR7)

BF16:  1800 / (3B * 2) = 1800 / 6  =  300 tok/s
FP8:   1800 / (3B * 1) = 1800 / 3  =  600 tok/s
NVFP4: 1800 / (3B * 0.5) = 1800 / 1.5 = 1200 tok/s
Quant MM-Kernel Theorie vLLM vLLM+E3 vLLM-next +E3 SGLang +EAGLE3 +DFlash SG-next [14] +E3 [14] +DF [14]
BF16 (2B) cuBLAS / TRITON MoE [10] / flashinfer 300 143.2 159.0 140.9 [10] 147.4 [10] 169.8 115.3 169.0 80.2 117.3
W8A8 ct (1B) 600 X [4] X [4]
FP8 block (1B) Marlin→Triton [10] 600 125.9 182.2 135.7 [10] 166.5 [10] X [8] X [8] X [8] X [8] X [8]
FP8 online (1B) 600 X [15] X [15] X [15]
NVFP4 (0.5B) CUTLASS→FI_CUTLASS [10] 1200 121.4! [6] 176.2! [6] 157.9 [10] 183.4! [10] 136.4! [6] 139.5! [17] 84.1! [14][17] 166.8! [17]

Fussnoten

  • [1] SGLang FusedMoE: “Unsupported scheme” fuer INT8 compressed-tensors
  • [2] Block-FP8 Triton Kernel: 60 tok/s aber 0% Math — Garbage Output auf SM121 (! = Qualitaet kompromittiert)
  • [3] NVFP4 auf SM121: Ungepatcht → CUDA Graph crash “Unsupported SM version: 121”. Mit CUTLASS 4.4 + admissible_archs Patch → 356 TFLOPS (BTankut)
  • [3b] CUTLASS FP4 auf SM121 bewiesen (BTankut), braucht CUTLASS 4.4 + sm_121a Patch + Tile-Tuning
  • [4] W8A8 INT8: “Int8 not supported on SM120” (vLLM) / “Unsupported FusedMoe scheme” (SGLang). INT8 nur auf SM<100
  • [5] DFlash PR Image hat sgl-kernel 0.3.16 < required 0.3.20. FP8-Online braucht neueres sgl-kernel als DFlash-Image bietet
  • [6] NVFP4 auf SM120 (RTX PRO 6000): 70-72% Math (vs 80% bei BF16/FP8). CUTLASS FP4 Fallback, FlashInfer MoE Kernels nicht verfuegbar. Nur ~10% der theoretischen 1200 tok/s. Vanilla 121.4, +EAGLE3 176.2 tok/s
  • [7] vLLM auf DGX Spark: Unified Memory Profiler-Bug erzwingt VLLM_USE_V1=0 --enforce-eager --gpu-memory-utilization 0.12 --max-model-len 4096. Kein fairer Vergleich mit SGLang (hat CUDA Graphs + groesseren KV-Cache)
  • [8] SGLang FP8 block auf SM120: DeepGEMM “Unknown recipe” + FlashInfer FP8 “capability 120 not supported”. Kein funktionierender FP8 block-GEMM Kernel auf SM120 in SGLang
  • [9] SGLang NVFP4+EAGLE3: ModelOpt NVFP4 Loader inkompatibel mit accelerate-basiertem EAGLE3 Loader. “state dictionary corrupted” weil ModelOpt FP4 Keys nicht erkannt werden
  • [10] vLLM-next = localhost/vllm-next (basierend auf nvcr.io/nvidia/vllm:26.01-py3, vLLM v0.1.dev1). Backends laut Container-Logs: BF16: MoE=TRITON (Unquantized), Attn=FLASH_ATTN. FP8: MoE=TRITON FP8, Attn=FLASH_ATTN. NVFP4: MoE=FLASHINFER_CUTLASS, Dense=FLASHINFER_CUTLASS, Attn=FLASHINFER. NVFP4 +30% vs alt (157.9 vs 121.4), BF16 ~gleich (140.9 vs 143.2)
  • [11] FP8 CUTLASS auf SM120 — GESCHEITERT auf 2 Wegen:
    • Root Cause: FlashInfer generate_kernels.py:749 instanziiert keine FP8×FP8 Kernels fuer SM120. supported_dtypes = [e2m1, (e4m3, e2m1)] — nur FP4×FP4 und FP8×FP4. SM100 hat DataType.e4m3 (FP8×FP8), SM120 nicht.
    • Weg A (FP8 per-tensor, versucht): 5 Python-Patches in vLLM/FlashInfer erfolgreich: Backend-Selektion, QuantScheme, Import, Activation-Scales Default (1.0), dtype-Fix. FLASHINFER_CUTLASS wird korrekt selektiert, Modell laedt (29 GB, 21s). JIT-Kompilierung scheitert: SM120 Block-Scaled Collective-Builder ist Template-inkompatibel mit FP8×FP8. Fehler: sm90_gemm_tma_warpspecialized_cooperative.hpp: "ProblemShape should be <M,N,K> or <M,N,K,L>". Braucht eigenes SM120 FP8×FP8 Kernel-Template (C++/CUTLASS-Entwicklung).
    • Weg B (FP8 block-scale, verworfen): CutlassFp8BlockScaleGemmRunner ist Hopper-only, nicht portierbar ohne TRT-LLM Kernel-Arbeit.
    • Fazit: FP8 CUTLASS auf SM120 ist ohne neues C++ Kernel-Template nicht moeglich. TRITON FP8 bleibt einziger Fallback (135.7 tok/s vanilla, 166.5 tok/s +EAGLE3)
  • [12] NVFP4 auf SM121 (DGX Spark) — ORIGINAL-FEHLER: FlashInfer CUTLASS JIT kompiliert mit sm_121a (__CUDA_ARCH__=1210), aber SM120 CUTLASS guards pruefen __CUDA_ARCH__==1200. CUTLASS_ARCH_MMA_SM120_ENABLED wird NICHT gesetzt → einige Kernel-Templates generieren ungueltige Instruktionen. Symtome: “Failed to initialize cutlass TMA WS grouped gemm” + “CUDA error: illegal instruction”. GELOEST mit FLASHINFER_CUDA_ARCH_LIST="12.0a 12.1a" — Dual-Arch-Kompilierung generiert sowohl SM120 als auch SM121 Binaries, CUDA Runtime waehlt korrekte Binary pro Kernel. Siehe [13].
  • [14] SGLang-next = localhost/sglang-next (DFlash PR #16818, sgl-kernel 0.3.21, vLLM 26.01 base). SGLang Speculative Decoding ist massiv langsamer als Vanilla auf ALLEN Plattformen und Quantisierungen. Spiegel 2 BF16: EAGLE3 -53% (80.2 vs 169.0), DFlash -31% (117.3 vs 169.0). DGX BF16: EAGLE3 -58% (13.1 vs 31.1), DFlash -34% (20.5 vs 31.1). DGX FP8: EAGLE3 -54% (24.5 vs 52.7), DFlash -21% (41.4 vs 52.7). Zum Vergleich: vLLM-next EAGLE3 bringt +5% (BF16) bzw +1% (FP8) auf DGX. SGLangs Scheduling-Overhead bei Spekulation ist prohibitiv.
  • [15] SGLang-next FP8 online auf Spiegel 2: OOM — BF16 Modell (57 GB) wird geladen, online FP8-Quantisierung verbraucht insgesamt 85 GB von 96 GB GPU.
  • [16] SGLang-next NVFP4 (alt): ModelOpt Routing-Bug. Fix: quant_method in config.json auf modelopt_fp4 setzen. Auf SM120 (Spiegel 2) funktioniert NVFP4 danach (siehe [17]). Auf SM121 (DGX) scheitert sgl_kernel (siehe [18]).
  • [17] SGLang-next NVFP4 auf SM120 (Spiegel 2): FUNKTIONIERT nach quant_method: "modelopt_fp4" Fix in config.json. CUTLASS FP4 SM120 Kernels von sgl_kernel. Vanilla 139.5 tok/s (74%), DFlash 166.8 tok/s (70%), EAGLE3 84.1 tok/s (70%). DFlash ist +20% schneller als Vanilla — einzige Plattform/Engine-Kombi wo SGLang-Spekulation bei NVFP4 hilft!
  • [18] (obsolet, siehe [19]) SGLang-next NVFP4 auf SM121 scheiterte an sgl_kernel.cutlass_fp4_group_mm: Unsupported SM version: 121. Geloest durch LD_PRELOAD Shim.
  • [19] SGLang-next NVFP4 auf SM121 (DGX): FUNKTIONIERT mit LD_PRELOAD sm120_shim.so (~/next/sm120_shim.cpp). Der Shim interceptet cudaDeviceGetAttribute und meldet SM12.0 statt SM12.1. Die CUTLASS FP4 Kernels in sgl_kernel sind fuer sm_120a kompiliert und laufen korrekt auf SM121 (gleiche Tensor Core Architektur). Braucht zusaetzlich quant_method: "modelopt_fp4" in config.json [17]. Podman-Flags: -v ~/next/sm120_shim.so:/opt/sm120_shim.so:ro -e LD_PRELOAD=/opt/sm120_shim.so. Ergebnis: Vanilla 66.0 tok/s (74%), DFlash 54.4 tok/s (74%), EAGLE3 27.4 tok/s (70%). Vanilla gleichauf mit vLLM-next (65.0)!
  • [13] NVFP4 CUTLASS auf SM121 FUNKTIONIERT mit FLASHINFER_CUDA_ARCH_LIST="12.0a 12.1a" env var. Kompiliert FlashInfer CUTLASS JIT fuer beide Arch-Targets gleichzeitig. SM121-spezifische Kernels (FP4 dense GEMM) nutzen sm_121a, SM120-CUTLASS-Templates (MoE grouped GEMM) nutzen sm_120a. Erster Start ~10 min (JIT-Kompilierung), danach gecacht. Ergebnis: 65.0 tok/s, 74% Math — schnellste Konfiguration auf DGX Spark (+29% vs FP8 online 50.5 tok/s). Math 74% erwartbar fuer NVFP4 (Spiegel 2 hat 72-80%).

This one? 1. Overview — cuda-binary-utilities 13.1 documentation

It is for Blackwell in general, but it lists both sm100 and sm120. Since sm121 is nearly identical to sm120 (other than unified memory quirks), it should be applicable.

Best proof. Thanks for the link. The tmem[URX] addressing format listed under valid destinations confirms it — Tensor Memory is a first-class address space on SM100. On SM120/SM121, this address space simply does not exist.

FP4/FP8 MMA Instructions

In the regular floating point section, we find:

  • OMMA — “FP4 Matrix Multiply and Accumulate Across a Warp”
  • QMMA — “FP8 Matrix Multiply and Accumulate Across a Warp”
  • HMMA — “Matrix Multiply and Accumulate” (FP16, present since Turing)

OMMA and QMMA are warp-level MMA instructions — these work on SM120/SM121. These are the tensor core instructions that CUTLASS accesses through the CuTe abstraction layer.

The UTC*MMA instructions (UTCHMMA, UTCOMMA, UTCQMMA) are the tensor memory coupled variants — they read/write directly from/to TMEM. These are the instructions that correspond to the tcgen05.mma PTX. They exist only on SM100!!!

What this means for the analysis

The difference:

SM120/SM121 (DGX Spark + RTX PRO 6000):

  • Uses OMMA (FP4) and QMMA (FP8) — warp-level MMA
  • Data comes from registers, loaded from shared memory
  • FP4 must go through: Shared Memory → Registers → unpack → Tensor Core

SM100 (B200 Datacenter):

  • Uses UTCOMMA (FP4) and UTCQMMA (FP8) — tensor memory coupled
  • Data flows: Shared Memory → UTCCP → Tensor Memory → UTCOMMA → computed directly
  • FP4 is expanded inside the tensor core itself, no register/shared memory overhead

This is the ISA-level confirmation of everything discussed. OMMA vs UTCOMMA — that is the difference between “FP4 works but doesn’t scale” and “FP4 scales natively.” NVIDIA documents this publicly, just without the footnote that "SM120 doesn’t have UTCOMMA.

Best Chance to have this working properly is NVIDIAs CUTLASS. They should know best how to handle…

And yes, my tests show minor performance gains if CUTLASS and Tile tuning and decode speculation and fixes and patches (see Fußnoten) and … found together..

And… keep an eye on the math test results. They degrade. On dgx much more than on rtx. This could be something driver related as I reported when it comes to tooling with larger context. It is a kind of a … very good challenge

From a progressive perspective, it could be cheaper and leave more room for optimization if it’s in software than in separate pipelines in hardware. But then business moves from selling the product to sell and support. Of course, it’s an even harder challenge if the room that’s left for compensating the missing hardware in software is also smaller. Let’s teleport 12 months into the future … will it be widely implemented by then, or will it have stayed in its niche? But these questions have been asked in different threads already.

There is some work going on with flashinfer and CuTeDSL, e.g. this one: [Draft][Cute,Fwd,Sm120] FA Cute DSL sm12x by johnnynunez · Pull Request #2222 · Dao-AILab/flash-attention · GitHub

This is awesome. Also, it lead me to this: 1. Introduction — PTX ISA 9.1 documentation

further findings…

EAGLE3 itself doesn’t degrade quality (BF16/FP8 stays at 80% with and without EAGLE3). But in NVFP4+EAGLE3 interaction: the drafter runs in BF16, the target in NVFP4 — the hidden state forwarding suffers from FP4 quantization, leading to subtly different verification outcomes. So it degrades.

Qwen3-Coder Speculative Decoding Benchmarks

BF16 — Stabil, kein Einfluss durch Speculation

Platform Engine Spec tok/s Math
DGX SGLang 31.7 78%
DGX SGLang EAGLE3 16.4 78%
DGX SGLang DFlash 17.6 78%
DGX vLLM EAGLE3 NST=1 33.2 80%
DGX vLLM EAGLE3 NST=2 24.2 80%
DGX vLLM EAGLE3 NST=3 28.7 80%
DGX vLLM EAGLE3 NST=4 17.3 80%
DGX vLLM EAGLE3 NST=5 18.6 80%
Spiegel 2 vLLM 143.2 80%
Spiegel 2 vLLM EAGLE3 159.0 80%

BF16 Math stabil bei 78–80% — kein Einfluss durch EAGLE3 oder DFlash.


FP8 + Speculation — KEINE Degradierung

Platform Engine Spec tok/s Math
DGX SGLang 55.0 78%
DGX SGLang EAGLE3 21.2 78%
DGX SGLang DFlash 80%
DGX vLLM EAGLE3 NST=1 57.3 80%
DGX vLLM EAGLE3 NST=2 47.1 82%
DGX vLLM EAGLE3 NST=3 52.8 78%
DGX vLLM EAGLE3 NST=4 40.5 80%
DGX vLLM EAGLE3 NST=5 36.8 80%
Spiegel 2 vLLM 125.9 80%
Spiegel 2 vLLM EAGLE3 182.2 80%
Spiegel 2 vLLM EAGLE3 NST=1 160.2 80%
Spiegel 2 vLLM EAGLE3 NST=2 171.4 80%
Spiegel 2 vLLM EAGLE3 NST=3 173.4 80%
Spiegel 2 vLLM EAGLE3 NST=4 156.3 80%

FP8 Math stabil bei 78–82% — kein Einfluss durch Speculation, egal welcher NST.


NVFP4 + Speculation — DEGRADIERUNG 2–12 Prozentpunkte

Platform Engine Spec tok/s Math Delta
DGX vLLM 65.0 74% Baseline
DGX vLLM EAGLE3 NST=1 72.5 72% -2pp
DGX vLLM EAGLE3 NST=3 67.0 68% -6pp
DGX vLLM EAGLE3 NST=4 48.1 76% +2pp
DGX SGLang 66.0 74% Baseline
DGX SGLang EAGLE3 27.4 70% -4pp
DGX SGLang DFlash 54.4 74% 0
Spiegel 2 vLLM 157.9 80% Baseline
Spiegel 2 vLLM EAGLE3 (alt) 183.4 74% -6pp
Spiegel 2 vLLM EAGLE3 NST=1 170.7 72% -8pp
Spiegel 2 vLLM EAGLE3 NST=3 176.4 72% -8pp
Spiegel 2 vLLM EAGLE3 NST=5 157.9 68% -12pp
Spiegel 2 SGLang 139.5 74% Baseline
Spiegel 2 SGLang EAGLE3 84.1 70% -4pp
Spiegel 2 SGLang DFlash 166.8 70% -4pp

Degradierung ist auf Spiegel 2 stärker (bis -12pp). Höhere NST-Werte verschärfen das Problem.


Zusammenfassung

Quant + Speculation Math Fazit
BF16 EAGLE3 / DFlash 78–80% ✅ Identisch mit Vanilla
FP8 EAGLE3 / DFlash 78–82% ✅ Identisch mit Vanilla
NVFP4 EAGLE3 68–76% ⚠️ 2–12pp Verlust vs 74–80% Vanilla
NVFP4 DFlash 70–74% ⚠️ 0–4pp Verlust

Ursache: Der EAGLE3-Drafter wurde auf BF16-Hidden-States trainiert. NVFP4-quantisierte Hidden States weichen davon ab → Draft-Qualität sinkt → fehlerhafte Tokens werden akzeptiert. Höhere NST-Werte verschärfen das Problem, weil mehr Tokens auf Basis fehlerhafter Drafts akzeptiert werden.

Also RTX PRO 6000 performs better with NST>1 while DGX is good with NST=1 only, i did not mention it again, but its the minimum again.

This could help reducing degration in nvfp4 and further degration when decode speculation is used on nvfp4. is this adopted widely? Will any existing nvfp4 model be requantisized?

Criterion AWQ NV FP4 INT4 AutoRound
Core Principle Activation-aware scaling Hardware-native FP4 Optimized rounding via SignSGD
Quality Very good Decent Very good to excellent
Calibration Effort Low (few samples) None Medium (~200 iterations)
Compatibility Broad (vLLM, TGI, etc.) Blackwell only GPTQ/AutoGPTQ compatible
Perplexity Retention Very good Good Tends to be best
Inference Speed Good Fastest (native HW) Good (GPTQ backend)
Ecosystem/Support Broadest NVIDIA stack only Growing
Error Minimization Very good Weaker Best
Recommendation Best trade-off Max throughput Best quality

Since the DGX Spark is already running NVFP4 dequantization through a shared memory path rather than dedicated tensor core units, that same compute path could theoretically be repurposed:
Store weights as INT4 with AWQ/AutoRound quantization parameters.
Use the shared memory dequantization path that’s already there for FP4
Apply the smarter scaling factors from AWQ/AutoRound instead of the naive FP4 format

This would essentially be a custom CUDA kernel that “hijacks” the existing memory bandwidth path but applies better quantization math. So if someone puts his hand in this, he can convert dgx into magic.

Mein Deutsch ist leider etwas eingerostet

Numbers speak for themselves. Since I was labeled a “molt poster” for wanting to format, possibly translate, and polish things up — using AI, of course — I’d rather leave some things in their original form.

Sourcecode is the best manual: vLLM can already process AutoRound INT4 out of the box.

The kernel path is:

AutoRoundConfig → dispatches to GPTQMarlinConfig → GPTQMarlinLinearMethod

Marlin loads INT4-packed weights from memory, dequantizes on-the-fly to FP16, computes via FP16 Tensor Cores (mma.sync.aligned.m16n8k16)

W4A16 is the default (INT4 weights → FP16 dequant → FP16 MMA). This works immediately without any patches.

W4A8 (FP8 MMA) was also already possible on SM120 using VLLM_MARLIN_INPUT_DTYPE=fp8. The code in generate_kernels.py and marlin.cu already explicitly included SM120. Only SM121 (DGX Spark) was blocked due to == 120 instead of >= 120.

new patches:

Python: is_device_capability(120) → is_device_capability_family(120) (matches SM12.x)

C++: major*10+minor == 120 → major == 12

Quant Image tok/s (medium) tok/s (long) Math (50) Memory
INT4 W4A16 vllm-next 54.9 53.2 88% 8 GiB
INT4 W4A8 vllm-next 55.1 54.2 88% 8 GiB
NVFP4 vllm-next 43.1 42.0 86% 8 GiB
FP8 vllm-glm 42.4 41.4 94% 16 GiB
BF16 vllm-glm 27.3 26.7 94% 32 GB

Tests with glm 4.7 flash

Yeah, Marlin kernel is very efficient on Spark, even though sm12x doesn’t support INT4 natively.