FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect

flash3 · February 9, 2026, 1:19pm

As baristankut and eugr shown in their posts, and after a lot of benchmarking and digging into the architecture, here’s what i like to summarize (llm formated, no molt).

FP16 → FP8: scales. Half the bytes, tensor cores handle it natively via the Ampere MMA path, real-world speedup ~1.3–1.7x. Solid.

FP8 → FP4: doesn’t scale. You’d expect another ~2x from halving the data again. But on SM12x three things have been cut from the “blackwell”:

No tcgen05. The 5th gen tensor core instructions that make FP4 fast on B200 (SM100) don’t exist on the Spark. No Tensor Memory, no native FP4 compute path. FP4 gets dequantized to FP8/FP16 before computation… no saving memory bandwidth.
99 KB shared memory. SM100 has 228 KB + 256 KB TMEM. The Spark has 99 KB. Smaller buffer means smaller GEMM tiles, more frequent reloads from the already limited 273 GB/s bus. This hits everything — attention, MoE, even speculative decoding — but FP4 hardest.
FP4 + small shared memory is the worst combo. It must be a feature.

And then there’s accuracy. NVIDIA has done impressive work with NVFP4’s micro-block scaling, but 4 bits is 4 bits. Others in the forum have already shown the quality loss is noticeable. My math tests too.

Use FP8 when the model fits. Use FP4 when it doesn’t.

For me its no surprise anymore, that you may read “minimum” so often if it comes to dgx spark and performance. It is best minimum you can have.

Has anyone seen different NVFP4 speedups?

flash3 · February 9, 2026, 1:55pm

some results w/o cutlass fixes. 590 driver, 26.01 image, vllm 0.15, sglang 0.5.8,

DGX Spark (GB10, 273 GB/s LPDDR5x)

BF16:  273 / (3B * 2) = 273 / 6  =  45.5 tok/s
FP8:   273 / (3B * 1) = 273 / 3  =  91.0 tok/s
NVFP4: 273 / (3B * 0.5) = 273 / 1.5 = 182.0 tok/s   (won't work)

Quant	MM-Kernel	Theorie	vLLM	vLLM+EAGLE3	SGLang	+EAGLE3	+DFlash
BF16 (2B)	cuBLAS+Triton MoE	45.5	28.8 [7]	29.0 [7]	31.7	16.4	17.6
FP8 online (1B)	Triton FP8 MoE	91	45.9 [M], 46.6 [C]		55.0	21.2	X [5]
W8A8 ct (1B)	—	91			X [1]
FP8 block (1B)	Triton FP8 MoE	91			60.1! [2]
NVFP4 (0.5B)	CUTLASS FP4 [3b]	182			X [3]

[M] = Marlin FP8 weight-only (kein echtes FP8 compute auf SM121)
[C] = CUTLASS FP8 scaled_mm (echtes FP8 compute, MoE weiterhin Marlin)

Spiegel 2 (RTX PRO 6000, 1800 GB/s GDDR7)

BF16:  1800 / (3B * 2) = 1800 / 6  =  300 tok/s
FP8:   1800 / (3B * 1) = 1800 / 3  =  600 tok/s
NVFP4: 1800 / (3B * 0.5) = 1800 / 1.5 = 1200 tok/s

Quant	MM-Kernel	Theorie	vLLM	vLLM+EAGLE3	SGLang	+EAGLE3
BF16 (2B)	cuBLAS / flashinfer	300	143.2	159.0	169.8	115.3
W8A8 ct (1B)	—	600	X [4]	X [4]
FP8 block (1B)	Marlin FP8 / flashinfer	600	125.9	182.2	X [8]	X [8]
NVFP4 (0.5B)	CUTLASS FP4	1200	121.4! [6]	176.2! [6]	136.4! [6]

Fussnoten

[1] SGLang FusedMoE: “Unsupported scheme” fuer INT8 compressed-tensors
[2] Block-FP8 Triton Kernel: 60 tok/s aber 0% Math — Garbage Output auf SM121 (! = Qualitaet kompromittiert)
[3] NVFP4 auf SM121: Ungepatcht → CUDA Graph crash “Unsupported SM version: 121”. Mit CUTLASS 4.4 + admissible_archs Patch → 356 TFLOPS (BTankut)
[3b] CUTLASS FP4 auf SM121 bewiesen (BTankut), braucht CUTLASS 4.4 + sm_121a Patch + Tile-Tuning
[4] W8A8 INT8: “Int8 not supported on SM120” (vLLM) / “Unsupported FusedMoe scheme” (SGLang). INT8 nur auf SM<100
[5] DFlash PR Image hat sgl-kernel 0.3.16 < required 0.3.20. FP8-Online braucht neueres sgl-kernel als DFlash-Image bietet
[6] NVFP4 auf SM120 (RTX PRO 6000): 70-72% Math (vs 80% bei BF16/FP8). CUTLASS FP4 Fallback, FlashInfer MoE Kernels nicht verfuegbar. Nur ~10% der theoretischen 1200 tok/s. Vanilla 121.4, +EAGLE3 176.2 tok/s
[7] vLLM auf DGX Spark: Unified Memory Profiler-Bug erzwingt VLLM_USE_V1=0 --enforce-eager --gpu-memory-utilization 0.12 --max-model-len 4096. Kein fairer Vergleich mit SGLang (hat CUDA Graphs + groesseren KV-Cache)
[8] SGLang FP8 block auf SM120: DeepGEMM “Unknown recipe” + FlashInfer FP8 “capability 120 not supported”. Kein funktionierender FP8 block-GEMM Kernel auf SM120 in SGLang
[9] SGLang NVFP4+EAGLE3: ModelOpt NVFP4 Loader inkompatibel mit accelerate-basiertem EAGLE3 Loader. “state dictionary corrupted” weil ModelOpt FP4 Keys nicht erkannt werden

eugr · February 9, 2026, 6:41pm

I’d correct this to “use AWQ when it doesn’t”. AWQ (INT4) quants are fast, and quality loss is very small as it keeps activation weights at FP16 and uses a calibration scale for quantized weights.

flash3 · February 9, 2026, 6:50pm

another followup… I fixed flashinfer cutlass 4.3+ (HEAD) to run FP8. So the step FP8 → FP4 is more comparable in same matrixmultiplication technology.

Triton FP8 MoE is a bit slower, so the FP4 using cutlass looks like a performance gain, but it is not.

eugr · February 9, 2026, 7:04pm

As far as I understand, consumer Blackwell (sm121/sm121) has native FP4 compute, just no tcgen5.
Instead of tensor memory, you can use shared memory or cuda registers.

@johnny_nv - feel free to chime in, you’ve spent a lot of time with this.

flash3 · February 9, 2026, 7:31pm

Which component handles it? It supports nvfp4 of course but this does not affect memory throughput (everything you may save is gone in computing cycles). And decoding in the even smaller shared memory is just more computing afterall.

eugr · February 9, 2026, 9:52pm

Tensor/CUDA cores support FP4 natively, so it should be possible to avoid dequantization. TBF, I’m not an expert in CUDA programming, so I’ll let others to chime in :)

christopher_owen · February 9, 2026, 11:01pm

loading weights and activations using a smaller dtype is a big win on the Spark. We are often bandwidth limited.

I don’t know the sm121 ISA (I couldn’t find it online), but I did spend a minor amount of time reading SASS in cuda-gdb and quite a lot of time in the flashinfer and cutlass codebases.

There is native support for activations in fp8 and weights in fp4, at least. Not all combinations of activations and weights are available in flashinfer and cutlass. I’m not 100% if it’s a limitation of the hardware or of flashinfer/cutlass, but I assume it’s hardware.

I believe it is true that there is no TMEM (tcgen5) support in the sm121.

flash3 · February 10, 2026, 7:21am

after fixing CUTLASS, Flashinfer and sg_kernel

Ziel

Vollstaendiger Vergleich aller Kombinationen aus Engine, Spekulation, Quantisierung und Plattform
fuer Qwen3-Coder-30B-A3B-Instruct (MoE 30B, 3B aktiv).

Referenz

DFlash-Autor testet mit RTX PRO 6000 + SGLang + LLaMA-3.1-8B: gute Ergebnisse bei Throughput,
aber Probleme mit Math-Accuracy unter DFlash (Modell rechnet nicht mehr richtig).
BTankut: GLM-4.7-FP8 auf DGX Spark Cluster mit EAGLE3 + SGLang, 20-27% Speedup.

Plattformen

System	GPU	VRAM	Bandwidth	Arch
DGX Spark	GB10	128 GB unified	273 GB/s LPDDR5x	SM121, aarch64
Spiegel 2	RTX PRO 6000	96 GB GDDR7	1800 GB/s	SM120, x86_64

Testmatrix

DGX Spark (SM121, 273 GB/s)

#	Engine	Spekulation	Quant	MM-Kernel	Status	tok/s	Math
1	SGLang	keine	BF16	cuBLAS+Triton MoE	done	31.7	78%
2	SGLang	EAGLE3	BF16	cuBLAS+Triton MoE	done	16.4	78%
3	SGLang	DFlash	BF16	cuBLAS+Triton MoE	done	17.6	78%
4	SGLang	keine	FP8 (block)	Triton FP8 MoE	done	60.1	0% GARBAGE
4b	SGLang	keine	FP8 (online)	Triton FP8 MoE	done	55.0	78%
4c	SGLang	EAGLE3	FP8 (online)	Triton FP8 MoE	done	21.2	78%
4d	SGLang	DFlash	FP8 (online)	Triton FP8 MoE	X [5]
5	SGLang	keine	W8A8 (ct)	—	X [1]
6	SGLang	EAGLE3	W8A8 (ct)	—	X [1]
7	SGLang	DFlash	W8A8 (ct)	—	X [1]
4e	vLLM	keine	FP8 (online)	Marlin FP8 w-only	done	45.9	78%
4f	vLLM	keine	FP8 (online)	CUTLASS scaled_mm	done	46.6	80%
8	SGLang	keine	NVFP4	CUTLASS FP4 [3b]	blocked [3]
9	SGLang	EAGLE3	NVFP4	CUTLASS FP4 [3b]	blocked [3]
10	SGLang	DFlash	NVFP4	CUTLASS FP4 [3b]	blocked [3]
35	vLLM-next	keine	BF16	TRITON MoE+FLASHINFER	done	30.6	80%
36	vLLM-next	EAGLE3	BF16	TRITON MoE+FLASHINFER	done	28.5	80%
37	vLLM-next	keine	FP8 (online)	CUTLASS scaled_mm+TRITON FP8 MoE	done	50.5	78%
38	vLLM-next	EAGLE3	FP8 (online)	CUTLASS scaled_mm+TRITON FP8 MoE	done	51.0	78%
39	vLLM-next	keine	NVFP4	FLASHINFER_CUTLASS	done [13]	65.0	74%
40	vLLM-next	EAGLE3	NVFP4	FLASHINFER_CUTLASS	done [13]	68.1	72%
41	SGLang-next	keine	BF16	cuBLAS+flashinfer	done	31.1	78%
42	SGLang-next	EAGLE3	BF16	cuBLAS+flashinfer	done [14]	13.1	78%
43	SGLang-next	DFlash	BF16	cuBLAS+flashinfer	done [14]	20.5	78%
44	SGLang-next	keine	FP8 (online)	Triton FP8 MoE	done	52.7	80%
45	SGLang-next	EAGLE3	FP8 (online)	Triton FP8 MoE	done [14]	24.5	78%
46	SGLang-next	DFlash	FP8 (online)	Triton FP8 MoE	done [14]	41.4	80%
47	SGLang-next	keine	NVFP4	CUTLASS FP4 SM120	done [19]	66.0	74%
48	SGLang-next	EAGLE3	NVFP4	CUTLASS FP4 SM120	done [14][19]	27.4	70%
49	SGLang-next	DFlash	NVFP4	CUTLASS FP4 SM120	done [19]	54.4	74%

Spiegel 2 (SM120, 1800 GB/s)

#	Engine	Spekulation	Quant	MM-Kernel	Status	tok/s	Math
11	vLLM	keine	BF16	cuBLAS	done	143.2	80%
12	vLLM	EAGLE3	BF16	cuBLAS	done	159.0	80%
13	vLLM	keine	FP8 (block)	Marlin FP8	done	125.9	80%
14	vLLM	EAGLE3	FP8 (block)	Marlin FP8	done	182.2	80%
15	vLLM	keine	W8A8 (ct)	—	X [4]
16	vLLM	EAGLE3	W8A8 (ct)	—	X [4]
17	SGLang	keine	BF16	cuBLAS+flashinfer	done	169.8	78%
17b	SGLang-next	keine	BF16	cuBLAS+flashinfer	done	169.0	78%
18	SGLang	EAGLE3	BF16	cuBLAS+flashinfer	done	115.3	78%
18b	SGLang-next	EAGLE3	BF16	cuBLAS+flashinfer	done	80.2	78%
19	SGLang-next	DFlash	BF16	cuBLAS+flashinfer	done	117.3	78%
20	SGLang	keine	FP8 (block)	flashinfer FP8	X [8]
21	SGLang	EAGLE3	FP8 (block)	flashinfer FP8	X [8]
22	SGLang	DFlash	FP8 (block)	flashinfer FP8	X [8]
23	SGLang	keine	W8A8 (ct)	—	X [4]
24	SGLang	EAGLE3	W8A8 (ct)	—	X [4]
25	SGLang	DFlash	W8A8 (ct)	—	X [4]
26	SGLang	keine	NVFP4	CUTLASS FP4 SM120	done	136.4	72%
27	SGLang	EAGLE3	NVFP4	CUTLASS FP4 SM120	X [9]
28	SGLang	DFlash	NVFP4	CUTLASS FP4 SM120	pending
29	vLLM-next	keine	BF16	TRITON MoE+FLASH_ATTN	done	140.9	80%
30	vLLM-next	EAGLE3	BF16	TRITON MoE+FLASH_ATTN	done	147.4	80%
31	vLLM-next	keine	FP8 (block)	TRITON FP8 MoE	done	135.7	78%
32	vLLM-next	EAGLE3	FP8 (block)	TRITON FP8 MoE	done	166.5	80%
33	vLLM-next	keine	NVFP4	FLASHINFER_CUTLASS	done	157.9	80%
34	vLLM-next	EAGLE3	NVFP4	FLASHINFER_CUTLASS	done	183.4	74%
50	SGLang-next	keine	FP8 (block)	flashinfer FP8	X [8]
51	SGLang-next	EAGLE3	FP8 (block)	flashinfer FP8	X [8]
52	SGLang-next	DFlash	FP8 (block)	flashinfer FP8	X [8]
53	SGLang-next	keine	NVFP4	CUTLASS FP4 SM120	done [17]	139.5	74%
54	SGLang-next	EAGLE3	NVFP4	CUTLASS FP4 SM120	done [14][17]	84.1	70%
55	SGLang-next	DFlash	NVFP4	CUTLASS FP4 SM120	done [17]	166.8	70%
56	SGLang-next	keine	FP8 (online)	—	X [15]
57	SGLang-next	EAGLE3	FP8 (online)	—	X [15]
58	SGLang-next	DFlash	FP8 (online)	—	X [15]

Quantisierungsformate

Kuerzel	Format	quant_method	Bytes/Param	Modellpfad
BF16	BFloat16	-	2	Qwen3-Coder-30B-A3B-Instruct
FP8 (block)	Block FP8 [128,128]	fp8 (native HF)	1	qwen3-coder-30B-fp8
FP8 (online)	Per-tensor dynamic FP8	–quantization fp8	1	Qwen3-Coder-30B-A3B-Instruct (BF16 + online quant)
W8A8 (ct)	Per-channel INT8	compressed-tensors	1	Qwen3-Coder-30B-A3B-Instruct-W8A8
NVFP4	NVIDIA FP4	modelopt	0.5	Qwen3-Coder-30B-A3B-Instruct-FP4

GEMM-Kernel-Bibliotheken nach Quantisierung

Welche Library fuehrt die Matrix-Multiplikation (GEMM) fuer welches Quantisierungsformat aus?

Quant	Kernel-Lib	Beschreibung	SM90 (Hopper)	SM100 (Blackwell)	SM121 (GB10)
BF16	cuBLAS	Standard NVIDIA BLAS	Ja	Ja	Ja
BF16	Triton MoE	Triton Fused MoE Kernel	Ja	Ja	Ja
FP8	DeepGEMM	JIT FP8 GEMM (DeepSeek), braucht tcgen05.mma	Ja	SM100 DC only	Nein
FP8	CUTLASS FP8	NVIDIA CUTLASS	Ja	Ja	Moeglich (SM120 Support)
FP8	Triton FP8	Triton Fallback (langsam)	Ja	Ja	Ja
FP8	torch._scaled_mm	PyTorch nativer Fallback	Ja	Ja	Ja
W8A8 INT8	CUTLASS INT8	NVIDIA CUTLASS	Ja	Nein	Nein
NVFP4	FlashInfer FP4 MoE	FlashInfer + CUTLASS FP4	?	Sollte	Ja [13]
NVFP4	CUTLASS FP4	CUTLASS Blockscaled NVFP4	Ja	Ja	Moeglich (SM120 Support)
NVFP4	CuTe DSL GEMM	Python-basierte CUTLASS Kernels	?	Ja	Moeglich (SM120 Support)
NVFP4	Marlin FP4	Marlin Kernel	Ja	?	Nein

Aktueller Status auf unseren Plattformen

Plattform	FP8 Kernel aktiv	NVFP4 Kernel aktiv	Folge
Spiegel 2 (SM120)	Triton FP8 (Fallback!)	CUTLASS FP4 (Fallback!)	FP8 langsamer als BF16, NVFP4 nur 10% der Theorie
DGX Spark (SM121)	Triton FP8	FLASHINFER_CUTLASS [13]	FP8-online + NVFP4 CUTLASS funktioniert! 65.0 tok/s

Strategische Bewertung: CUTLASS vs DeepGEMM

Kriterium	DeepGEMM	CUTLASS
SM120 Support	Nein (`tcgen05.mma` fehlt)	Ja (SM120 Familie)
SM121 Support	Nein (`tcgen05.mma` fehlt)	Ja (mit Patch)
FP8 GEMM	Ja (SM90+SM100 DC only)	Ja (SM90+SM100+SM120+SM121)
NVFP4 GEMM	Nein	Ja (blockscaled dense, grouped, sparse)
Einstiegshuerde	Niedrig (JIT)	Gesunken durch CuTe DSL (Python-basiert)
vLLM Integration	Ja	Ja
SGLang Integration	Ja	Ja (CuteDSL-GEMM ersetzt FP8-DeepGEMM bei FP4)

Fazit: CUTLASS ist der einzig gangbare Pfad fuer SM120 und SM121. DeepGEMM NUR auf SM90 (Hopper) und SM100 (Datacenter Blackwell).

DeepGEMM — Ergebnis

DeepGEMM getestet auf beiden Plattformen — scheitert auf beiden:

SM120 (RTX PRO 6000): tcgen05.mma not supported on .target sm_120a
SM121 (DGX Spark GB10): tcgen05.mma not supported on .target sm_121a
tcgen05.mma/fence = Tensor Core Gen 5 Instruktionen, exklusiv SM100 Datacenter
Software-Patches (arch-routing, symbol extraction) funktionieren, aber die GEMM-Kernels selbst kompilieren nicht

SM120/SM121 Architektur-Detail (korrigiert nach BTankut)

SM120/SM121 haben kein tcgen05, kein Multicast, kein 2-SM MMA (CTA Pairs).
Die Tensor Cores sind leistungsfaehig (356 TFLOPS NVFP4, 188 TFLOPS FP8) aber architekturbedingt
verschieden von SM100 (Datacenter Blackwell).

Unterschiede zu SM100:

tcgen05.mma PTX-Instruktion: nicht unterstuetzt auf sm_120a/sm_121a
Kein Multicast, keine CTA Pairs, nur TN-Layout, kein PDL/GDC
SMEM-Limits: 101.376 Bytes/SM (≈100KB, wie RTX 4090) vs 228KB auf SM100
Cluster fest 1x1x1

Wie funktioniert CUTLASS dann?
CUTLASS 3x (CuTe API) hat eigene SM120-Kernelpfade (z.B. gemm_grouped_sm120_M128_BS_group1),
die die SM120-Tensor-Cores ueber die CuTe-Abstraktion ansprechen — ohne tcgen05.mma PTX.
Braucht sm_121a in BlockScaledMmaOp.admissible_archs (CUTLASS Issue #2800).

Konsequenz:

CUTLASS 3x (CuTe API): funktioniert auf SM120/SM121 (FP8, NVFP4, BF16 grouped GEMM)
DeepGEMM (inline PTX tcgen05.mma): scheitert auf SM120/SM121
Triton: behandelt SM121 als SM80 (Fallback)
Default MoE-Configs von SGLang/vLLM: crashen auf GB10 (brauchen ~147KB SMEM, nur 101KB vorhanden)

CUTLASS SM120/SM121 — Voraussetzungen (BTankut-Analyse)

Basierend auf BTankuts Arbeit (GitHub: BTankut/dgx-spark-sglang-moe-configs) fuer SM121:

Shared Memory Limits — das zentrale Constraint:

Plattform	SM	SMEM/SM	SMEM/Block	SMs	Folge
DGX Spark (GB10)	SM121	101.376 B (~99 KB)	48 KB	48	Default-Tiles crashen
Spiegel 2 (RTX PRO 6000)	SM120	101.376 B (~99 KB)	48 KB	188	Gleiche Limits!
Datacenter (B200)	SM100	228 KB	164 KB	168	Default-Tiles passen

4 Probleme, die geloest werden muessen:

#	Problem	Beschreibung	Loesung
1	CUTLASS erkennt SM121 nicht	`BlockScaledMmaOp.admissible_archs` hat SM121 nicht	Einzeiler-Patch: `sm_121a` hinzufuegen (CUTLASS Issue #2800)
2	Default-Tile-Sizes sprengen SMEM	SGLang/vLLM Heuristiken gehen von 128-228 KB aus, SM120/121 hat nur 100-100 KB	Tile-Sweep innerhalb 100 KB Budget
3	Keine MoE-Configs fuer GB10	SGLang hat geraetespezifische Config-Files — fuer NVIDIA_GB10 existierten keine	4 Config-Files generiert (2 pro Triton-Version)
4	EAGLE crasht ohne tuned Configs	EAGLE Speculative Decoding erzeugt groessere Batches → triggert SMEM-Overflow	Tuned Configs sind Voraussetzung fuer EAGLE, nicht nur Optimierung

BTankut-Ergebnisse (SM121, GLM-4.7-FP8, 4x DGX Spark TP=4):

Szenario	MoE Configs	EAGLE	Ergebnis
A	Optimized	Off	16.77 tok/s
B	Default	Off	15.77 tok/s (-6.3%)
C	Default	On	OutOfResources CRASH
D	Optimized	On	20-27 tok/s

CUTLASS GEMM Peak-Performance (SM121, BTankut):

NVFP4 dense: 356 TFLOPS (71% von 500 TFLOPS dense FP4 peak, 1 PFLOPS spec inkl. Sparsity)
FP8 dense: 188 TFLOPS
MoE Grouped GEMM (8 & 64 Experts): 120-154 TFLOPS (tile-abhaengig)
- Tile 256×128: 154 TFLOPS (optimal fuer Prefill / grosse Batches)
- Tile 128×128: ~147 TFLOPS (optimal fuer Decode / kleine Batches)

Rezept fuer SM120/SM121: CUTLASS 4.4 + admissible_archs Patch + Tile-Tuning (101KB SMEM Budget)

BTankut Docker-Image: ghcr.io/btankut/sglang-spark-glm47:latest (MoE Configs + Patches vorinstalliert)

Was fehlt / was wuerde helfen

Aktion	Erwarteter Effekt
~~DeepGEMM auf Spiegel 2~~	GESCHEITERT: `tcgen05.mma` nicht auf SM120a
~~DeepGEMM auf DGX Spark~~	GESCHEITERT: `tcgen05.mma` nicht auf SM121a
vLLM 26.01 auf Spiegel 2	Image-Pull scheitert an NIC-Corruption (tls: bad record MAC)
~~FP8 CUTLASS per-tensor auf SM120~~	GESCHEITERT: `DataType.e4m3` zu SM120 hinzufuegen kompiliert nicht. SM120 Kernel-Templates nutzen Block-Scaled Collective-Builder, FP8×FP8 braucht Standard-Builder → CUTLASS Template-Error. Python-Gates (5 Patches) funktionieren, FLASHINFER_CUTLASS wird selektiert, aber JIT-Kernel-Kompilierung scheitert. Braucht eigenes SM120 FP8×FP8 Kernel-Template
~~CUTLASS FP8 block-scale auf SM120~~	Verworfen: `CutlassFp8BlockScaleGemmRunner` Hopper-only, nicht portierbar ohne TRT-LLM Kernel-Arbeit
CUTLASS FP8/NVFP4 auf SM121 aktivieren	BTankut-Configs bereits vorhanden, Integration in SGLang/vLLM. BTankut: “CUTLASS 4.4.0 with sm_121a target works”
sgl_kernel fuer SM120 bauen	SGLang auf Spiegel 2 braucht SM120-Kernels (pip-Version hat nur SM100)
FlashInfer FP4 MoE auf SM120	NVFP4 deutlich schneller (>300 tok/s statt 121)

Theoretische Limits (zero context, single request)

Plattform	BF16	FP8/W8A8	NVFP4
DGX Spark (273 GB/s)	45.5 tok/s	91 tok/s	182 tok/s
Spiegel 2 (1800 GB/s)	300 tok/s	600 tok/s	1200 tok/s

Draft-Modelle

Drafter	Pfad	Groesse
EAGLE3	/data/tensordata/SGLang-EAGLE3-Qwen3-Coder-30B-A3B	~500 MB
DFlash	/data/tensordata/Qwen3-Coder-30B-A3B-DFlash	~900 MB

Ergebnistabellen (Ziel)

Am Ende stehen zwei Tabellen — eine je Plattform. Alle Werte in tok/s (long, ~400 Tokens).

Theorie-Formel (memory-bound, single request, zero context)

tok/s = Bandwidth / (aktive_Parameter * Bytes_pro_Parameter)

Modell: Qwen3-Coder-30B-A3B-Instruct (MoE, 3B aktive Parameter pro Token)
BF16: 2 Bytes/Param → 3B * 2B = 6 GB pro Forward Pass
FP8/W8A8: 1 Byte/Param → 3B * 1B = 3 GB pro Forward Pass
NVFP4: 0.5 Bytes/Param → 3B * 0.5B = 1.5 GB pro Forward Pass
Context/KV-Cache Overhead hier nicht beruecksichtigt (zero context)

Legende:

X = kaputt/nicht nutzbar (startet nicht oder Garbage)
! = Math-Plausibilitaetstest gescheitert (tok/s-Wert mit Kompromissen bei Qualitaet)
Blanko = noch nicht getestet

DGX Spark (GB10, 273 GB/s LPDDR5x)

BF16:  273 / (3B * 2) = 273 / 6  =  45.5 tok/s
FP8:   273 / (3B * 1) = 273 / 3  =  91.0 tok/s
NVFP4: 273 / (3B * 0.5) = 273 / 1.5 = 182.0 tok/s

Quant	MM-Kernel	Theorie	vLLM	vLLM+E3	vLLM-next	+E3	SGLang	+EAGLE3	+DFlash	SG-next [14]	+E3 [14]	+DF [14]
BF16 (2B)	cuBLAS / TRITON MoE	45.5	28.8 [7]	29.0 [7]	30.6	28.5	31.7	16.4	17.6	31.1	13.1	20.5
FP8 online (1B)	CUTLASS+TRITON FP8 MoE	91	45.9 [M], 46.6 [C]		50.5	51.0	55.0	21.2	X [5]	52.7	24.5	41.4
W8A8 ct (1B)	—	91					X [1]
FP8 block (1B)	Triton FP8 MoE	91					60.1! [2]
NVFP4 (0.5B)	FLASHINFER_CUTLASS / CUTLASS FP4	182			65.0 [13]	68.1 [13]	X [3]			66.0! [19]	27.4! [14][19]	54.4! [19]

[M] = Marlin FP8 weight-only (kein echtes FP8 compute auf SM121)
[C] = CUTLASS FP8 scaled_mm (echtes FP8 compute, MoE weiterhin Marlin)

Spiegel 2 (RTX PRO 6000, 1800 GB/s GDDR7)

BF16:  1800 / (3B * 2) = 1800 / 6  =  300 tok/s
FP8:   1800 / (3B * 1) = 1800 / 3  =  600 tok/s
NVFP4: 1800 / (3B * 0.5) = 1800 / 1.5 = 1200 tok/s

Quant	MM-Kernel	Theorie	vLLM	vLLM+E3	vLLM-next	+E3	SGLang	+EAGLE3	SG-next [14]	+E3 [14]	+DF [14]
BF16 (2B)	cuBLAS / TRITON MoE [10] / flashinfer	300	143.2	159.0	140.9 [10]	147.4 [10]	169.8	115.3	169.0	80.2	117.3
W8A8 ct (1B)	—	600	X [4]	X [4]
FP8 block (1B)	Marlin→Triton [10]	600	125.9	182.2	135.7 [10]	166.5 [10]	X [8]	X [8]	X [8]	X [8]	X [8]
FP8 online (1B)	—	600							X [15]	X [15]	X [15]
NVFP4 (0.5B)	CUTLASS→FI_CUTLASS [10]	1200	121.4! [6]	176.2! [6]	157.9 [10]	183.4! [10]	136.4! [6]		139.5! [17]	84.1! [14][17]	166.8! [17]

Fussnoten

[1] SGLang FusedMoE: “Unsupported scheme” fuer INT8 compressed-tensors
[2] Block-FP8 Triton Kernel: 60 tok/s aber 0% Math — Garbage Output auf SM121 (! = Qualitaet kompromittiert)
[3] NVFP4 auf SM121: Ungepatcht → CUDA Graph crash “Unsupported SM version: 121”. Mit CUTLASS 4.4 + admissible_archs Patch → 356 TFLOPS (BTankut)
[3b] CUTLASS FP4 auf SM121 bewiesen (BTankut), braucht CUTLASS 4.4 + sm_121a Patch + Tile-Tuning
[4] W8A8 INT8: “Int8 not supported on SM120” (vLLM) / “Unsupported FusedMoe scheme” (SGLang). INT8 nur auf SM<100
[5] DFlash PR Image hat sgl-kernel 0.3.16 < required 0.3.20. FP8-Online braucht neueres sgl-kernel als DFlash-Image bietet
[6] NVFP4 auf SM120 (RTX PRO 6000): 70-72% Math (vs 80% bei BF16/FP8). CUTLASS FP4 Fallback, FlashInfer MoE Kernels nicht verfuegbar. Nur ~10% der theoretischen 1200 tok/s. Vanilla 121.4, +EAGLE3 176.2 tok/s
[7] vLLM auf DGX Spark: Unified Memory Profiler-Bug erzwingt VLLM_USE_V1=0 --enforce-eager --gpu-memory-utilization 0.12 --max-model-len 4096. Kein fairer Vergleich mit SGLang (hat CUDA Graphs + groesseren KV-Cache)
[8] SGLang FP8 block auf SM120: DeepGEMM “Unknown recipe” + FlashInfer FP8 “capability 120 not supported”. Kein funktionierender FP8 block-GEMM Kernel auf SM120 in SGLang
[9] SGLang NVFP4+EAGLE3: ModelOpt NVFP4 Loader inkompatibel mit accelerate-basiertem EAGLE3 Loader. “state dictionary corrupted” weil ModelOpt FP4 Keys nicht erkannt werden
[10] vLLM-next = localhost/vllm-next (basierend auf nvcr.io/nvidia/vllm:26.01-py3, vLLM v0.1.dev1). Backends laut Container-Logs: BF16: MoE=TRITON (Unquantized), Attn=FLASH_ATTN. FP8: MoE=TRITON FP8, Attn=FLASH_ATTN. NVFP4: MoE=FLASHINFER_CUTLASS, Dense=FLASHINFER_CUTLASS, Attn=FLASHINFER. NVFP4 +30% vs alt (157.9 vs 121.4), BF16 ~gleich (140.9 vs 143.2)
[11] FP8 CUTLASS auf SM120 — GESCHEITERT auf 2 Wegen:
- Root Cause: FlashInfer generate_kernels.py:749 instanziiert keine FP8×FP8 Kernels fuer SM120. supported_dtypes = [e2m1, (e4m3, e2m1)] — nur FP4×FP4 und FP8×FP4. SM100 hat DataType.e4m3 (FP8×FP8), SM120 nicht.
- Weg A (FP8 per-tensor, versucht): 5 Python-Patches in vLLM/FlashInfer erfolgreich: Backend-Selektion, QuantScheme, Import, Activation-Scales Default (1.0), dtype-Fix. FLASHINFER_CUTLASS wird korrekt selektiert, Modell laedt (29 GB, 21s). JIT-Kompilierung scheitert: SM120 Block-Scaled Collective-Builder ist Template-inkompatibel mit FP8×FP8. Fehler: sm90_gemm_tma_warpspecialized_cooperative.hpp: "ProblemShape should be <M,N,K> or <M,N,K,L>". Braucht eigenes SM120 FP8×FP8 Kernel-Template (C++/CUTLASS-Entwicklung).
- Weg B (FP8 block-scale, verworfen): CutlassFp8BlockScaleGemmRunner ist Hopper-only, nicht portierbar ohne TRT-LLM Kernel-Arbeit.
- Fazit: FP8 CUTLASS auf SM120 ist ohne neues C++ Kernel-Template nicht moeglich. TRITON FP8 bleibt einziger Fallback (135.7 tok/s vanilla, 166.5 tok/s +EAGLE3)
[12] NVFP4 auf SM121 (DGX Spark) — ORIGINAL-FEHLER: FlashInfer CUTLASS JIT kompiliert mit sm_121a (__CUDA_ARCH__=1210), aber SM120 CUTLASS guards pruefen __CUDA_ARCH__==1200. CUTLASS_ARCH_MMA_SM120_ENABLED wird NICHT gesetzt → einige Kernel-Templates generieren ungueltige Instruktionen. Symtome: “Failed to initialize cutlass TMA WS grouped gemm” + “CUDA error: illegal instruction”. GELOEST mit FLASHINFER_CUDA_ARCH_LIST="12.0a 12.1a" — Dual-Arch-Kompilierung generiert sowohl SM120 als auch SM121 Binaries, CUDA Runtime waehlt korrekte Binary pro Kernel. Siehe [13].
[14] SGLang-next = localhost/sglang-next (DFlash PR #16818, sgl-kernel 0.3.21, vLLM 26.01 base). SGLang Speculative Decoding ist massiv langsamer als Vanilla auf ALLEN Plattformen und Quantisierungen. Spiegel 2 BF16: EAGLE3 -53% (80.2 vs 169.0), DFlash -31% (117.3 vs 169.0). DGX BF16: EAGLE3 -58% (13.1 vs 31.1), DFlash -34% (20.5 vs 31.1). DGX FP8: EAGLE3 -54% (24.5 vs 52.7), DFlash -21% (41.4 vs 52.7). Zum Vergleich: vLLM-next EAGLE3 bringt +5% (BF16) bzw +1% (FP8) auf DGX. SGLangs Scheduling-Overhead bei Spekulation ist prohibitiv.
[15] SGLang-next FP8 online auf Spiegel 2: OOM — BF16 Modell (57 GB) wird geladen, online FP8-Quantisierung verbraucht insgesamt 85 GB von 96 GB GPU.
[16] SGLang-next NVFP4 (alt): ModelOpt Routing-Bug. Fix: quant_method in config.json auf modelopt_fp4 setzen. Auf SM120 (Spiegel 2) funktioniert NVFP4 danach (siehe [17]). Auf SM121 (DGX) scheitert sgl_kernel (siehe [18]).
[17] SGLang-next NVFP4 auf SM120 (Spiegel 2): FUNKTIONIERT nach quant_method: "modelopt_fp4" Fix in config.json. CUTLASS FP4 SM120 Kernels von sgl_kernel. Vanilla 139.5 tok/s (74%), DFlash 166.8 tok/s (70%), EAGLE3 84.1 tok/s (70%). DFlash ist +20% schneller als Vanilla — einzige Plattform/Engine-Kombi wo SGLang-Spekulation bei NVFP4 hilft!
[18] (obsolet, siehe [19]) SGLang-next NVFP4 auf SM121 scheiterte an sgl_kernel.cutlass_fp4_group_mm: Unsupported SM version: 121. Geloest durch LD_PRELOAD Shim.
[19] SGLang-next NVFP4 auf SM121 (DGX): FUNKTIONIERT mit LD_PRELOAD sm120_shim.so (~/next/sm120_shim.cpp). Der Shim interceptet cudaDeviceGetAttribute und meldet SM12.0 statt SM12.1. Die CUTLASS FP4 Kernels in sgl_kernel sind fuer sm_120a kompiliert und laufen korrekt auf SM121 (gleiche Tensor Core Architektur). Braucht zusaetzlich quant_method: "modelopt_fp4" in config.json [17]. Podman-Flags: -v ~/next/sm120_shim.so:/opt/sm120_shim.so:ro -e LD_PRELOAD=/opt/sm120_shim.so. Ergebnis: Vanilla 66.0 tok/s (74%), DFlash 54.4 tok/s (74%), EAGLE3 27.4 tok/s (70%). Vanilla gleichauf mit vLLM-next (65.0)!
[13] NVFP4 CUTLASS auf SM121 FUNKTIONIERT mit FLASHINFER_CUDA_ARCH_LIST="12.0a 12.1a" env var. Kompiliert FlashInfer CUTLASS JIT fuer beide Arch-Targets gleichzeitig. SM121-spezifische Kernels (FP4 dense GEMM) nutzen sm_121a, SM120-CUTLASS-Templates (MoE grouped GEMM) nutzen sm_120a. Erster Start ~10 min (JIT-Kompilierung), danach gecacht. Ergebnis: 65.0 tok/s, 74% Math — schnellste Konfiguration auf DGX Spark (+29% vs FP8 online 50.5 tok/s). Math 74% erwartbar fuer NVFP4 (Spiegel 2 hat 72-80%).

eugr · February 10, 2026, 8:30am

This one? 1. Overview — cuda-binary-utilities 13.1 documentation

It is for Blackwell in general, but it lists both sm100 and sm120. Since sm121 is nearly identical to sm120 (other than unified memory quirks), it should be applicable.

flash3 · February 10, 2026, 9:05am

Best proof. Thanks for the link. The tmem[URX] addressing format listed under valid destinations confirms it — Tensor Memory is a first-class address space on SM100. On SM120/SM121, this address space simply does not exist.

FP4/FP8 MMA Instructions

In the regular floating point section, we find:

OMMA — “FP4 Matrix Multiply and Accumulate Across a Warp”
QMMA — “FP8 Matrix Multiply and Accumulate Across a Warp”
HMMA — “Matrix Multiply and Accumulate” (FP16, present since Turing)

OMMA and QMMA are warp-level MMA instructions — these work on SM120/SM121. These are the tensor core instructions that CUTLASS accesses through the CuTe abstraction layer.

The UTC*MMA instructions (UTCHMMA, UTCOMMA, UTCQMMA) are the tensor memory coupled variants — they read/write directly from/to TMEM. These are the instructions that correspond to the tcgen05.mma PTX. They exist only on SM100!!!

What this means for the analysis

The difference:

SM120/SM121 (DGX Spark + RTX PRO 6000):

Uses OMMA (FP4) and QMMA (FP8) — warp-level MMA
Data comes from registers, loaded from shared memory
FP4 must go through: Shared Memory → Registers → unpack → Tensor Core

SM100 (B200 Datacenter):

Uses UTCOMMA (FP4) and UTCQMMA (FP8) — tensor memory coupled
Data flows: Shared Memory → UTCCP → Tensor Memory → UTCOMMA → computed directly
FP4 is expanded inside the tensor core itself, no register/shared memory overhead

This is the ISA-level confirmation of everything discussed. OMMA vs UTCOMMA — that is the difference between “FP4 works but doesn’t scale” and “FP4 scales natively.” NVIDIA documents this publicly, just without the footnote that "SM120 doesn’t have UTCOMMA.

Best Chance to have this working properly is NVIDIAs CUTLASS. They should know best how to handle…

And yes, my tests show minor performance gains if CUTLASS and Tile tuning and decode speculation and fixes and patches (see Fußnoten) and … found together..

And… keep an eye on the math test results. They degrade. On dgx much more than on rtx. This could be something driver related as I reported when it comes to tooling with larger context. It is a kind of a … very good challenge

From a progressive perspective, it could be cheaper and leave more room for optimization if it’s in software than in separate pipelines in hardware. But then business moves from selling the product to sell and support. Of course, it’s an even harder challenge if the room that’s left for compensating the missing hardware in software is also smaller. Let’s teleport 12 months into the future … will it be widely implemented by then, or will it have stayed in its niche? But these questions have been asked in different threads already.

eugr · February 10, 2026, 10:03pm

There is some work going on with flashinfer and CuTeDSL, e.g. this one: [Draft][Cute,Fwd,Sm120] FA Cute DSL sm12x by johnnynunez · Pull Request #2222 · Dao-AILab/flash-attention · GitHub

christopher_owen · February 10, 2026, 10:35pm

This is awesome. Also, it lead me to this: 1. Introduction — PTX ISA 9.1 documentation

flash3 · February 10, 2026, 10:42pm

further findings…

EAGLE3 itself doesn’t degrade quality (BF16/FP8 stays at 80% with and without EAGLE3). But in NVFP4+EAGLE3 interaction: the drafter runs in BF16, the target in NVFP4 — the hidden state forwarding suffers from FP4 quantization, leading to subtly different verification outcomes. So it degrades.

Qwen3-Coder Speculative Decoding Benchmarks

BF16 — Stabil, kein Einfluss durch Speculation

Platform	Engine	Spec	tok/s	Math
DGX	SGLang	—	31.7	78%
DGX	SGLang	EAGLE3	16.4	78%
DGX	SGLang	DFlash	17.6	78%
DGX	vLLM	EAGLE3 NST=1	33.2	80%
DGX	vLLM	EAGLE3 NST=2	24.2	80%
DGX	vLLM	EAGLE3 NST=3	28.7	80%
DGX	vLLM	EAGLE3 NST=4	17.3	80%
DGX	vLLM	EAGLE3 NST=5	18.6	80%
Spiegel 2	vLLM	—	143.2	80%
Spiegel 2	vLLM	EAGLE3	159.0	80%

BF16 Math stabil bei 78–80% — kein Einfluss durch EAGLE3 oder DFlash.

FP8 + Speculation — KEINE Degradierung

Platform	Engine	Spec	tok/s	Math
DGX	SGLang	—	55.0	78%
DGX	SGLang	EAGLE3	21.2	78%
DGX	SGLang	DFlash	—	80%
DGX	vLLM	EAGLE3 NST=1	57.3	80%
DGX	vLLM	EAGLE3 NST=2	47.1	82%
DGX	vLLM	EAGLE3 NST=3	52.8	78%
DGX	vLLM	EAGLE3 NST=4	40.5	80%
DGX	vLLM	EAGLE3 NST=5	36.8	80%
Spiegel 2	vLLM	—	125.9	80%
Spiegel 2	vLLM	EAGLE3	182.2	80%
Spiegel 2	vLLM	EAGLE3 NST=1	160.2	80%
Spiegel 2	vLLM	EAGLE3 NST=2	171.4	80%
Spiegel 2	vLLM	EAGLE3 NST=3	173.4	80%
Spiegel 2	vLLM	EAGLE3 NST=4	156.3	80%

FP8 Math stabil bei 78–82% — kein Einfluss durch Speculation, egal welcher NST.

NVFP4 + Speculation — DEGRADIERUNG 2–12 Prozentpunkte

Platform	Engine	Spec	tok/s	Math	Delta
DGX	vLLM	—	65.0	74%	Baseline
DGX	vLLM	EAGLE3 NST=1	72.5	72%	-2pp
DGX	vLLM	EAGLE3 NST=3	67.0	68%	-6pp
DGX	vLLM	EAGLE3 NST=4	48.1	76%	+2pp
DGX	SGLang	—	66.0	74%	Baseline
DGX	SGLang	EAGLE3	27.4	70%	-4pp
DGX	SGLang	DFlash	54.4	74%	0
Spiegel 2	vLLM	—	157.9	80%	Baseline
Spiegel 2	vLLM	EAGLE3 (alt)	183.4	74%	-6pp
Spiegel 2	vLLM	EAGLE3 NST=1	170.7	72%	-8pp
Spiegel 2	vLLM	EAGLE3 NST=3	176.4	72%	-8pp
Spiegel 2	vLLM	EAGLE3 NST=5	157.9	68%	-12pp
Spiegel 2	SGLang	—	139.5	74%	Baseline
Spiegel 2	SGLang	EAGLE3	84.1	70%	-4pp
Spiegel 2	SGLang	DFlash	166.8	70%	-4pp

Degradierung ist auf Spiegel 2 stärker (bis -12pp). Höhere NST-Werte verschärfen das Problem.

Zusammenfassung

Quant	+ Speculation	Math	Fazit
BF16	EAGLE3 / DFlash	78–80%	✅ Identisch mit Vanilla
FP8	EAGLE3 / DFlash	78–82%	✅ Identisch mit Vanilla
NVFP4	EAGLE3	68–76%	⚠️ 2–12pp Verlust vs 74–80% Vanilla
NVFP4	DFlash	70–74%	⚠️ 0–4pp Verlust

Ursache: Der EAGLE3-Drafter wurde auf BF16-Hidden-States trainiert. NVFP4-quantisierte Hidden States weichen davon ab → Draft-Qualität sinkt → fehlerhafte Tokens werden akzeptiert. Höhere NST-Werte verschärfen das Problem, weil mehr Tokens auf Basis fehlerhafter Drafts akzeptiert werden.

Also RTX PRO 6000 performs better with NST>1 while DGX is good with NST=1 only, i did not mention it again, but its the minimum again.

flash3 · February 11, 2026, 8:21am

This could help reducing degration in nvfp4 and further degration when decode speculation is used on nvfp4. is this adopted widely? Will any existing nvfp4 model be requantisized?

flash3 · February 13, 2026, 5:25pm

Criterion	AWQ	NV FP4	INT4 AutoRound
Core Principle	Activation-aware scaling	Hardware-native FP4	Optimized rounding via SignSGD
Quality	Very good	Decent	Very good to excellent
Calibration Effort	Low (few samples)	None	Medium (~200 iterations)
Compatibility	Broad (vLLM, TGI, etc.)	Blackwell only	GPTQ/AutoGPTQ compatible
Perplexity Retention	Very good	Good	Tends to be best
Inference Speed	Good	Fastest (native HW)	Good (GPTQ backend)
Ecosystem/Support	Broadest	NVIDIA stack only	Growing
Error Minimization	Very good	Weaker	Best
Recommendation	Best trade-off	Max throughput	Best quality

Since the DGX Spark is already running NVFP4 dequantization through a shared memory path rather than dedicated tensor core units, that same compute path could theoretically be repurposed:
Store weights as INT4 with AWQ/AutoRound quantization parameters.
Use the shared memory dequantization path that’s already there for FP4
Apply the smarter scaling factors from AWQ/AutoRound instead of the naive FP4 format

This would essentially be a custom CUDA kernel that “hijacks” the existing memory bandwidth path but applies better quantization math. So if someone puts his hand in this, he can convert dgx into magic.

raphael.amorim · February 13, 2026, 6:04pm

Mein Deutsch ist leider etwas eingerostet

flash3 · February 13, 2026, 6:47pm

Numbers speak for themselves. Since I was labeled a “molt poster” for wanting to format, possibly translate, and polish things up — using AI, of course — I’d rather leave some things in their original form.

flash3 · February 13, 2026, 10:41pm

Sourcecode is the best manual: vLLM can already process AutoRound INT4 out of the box.

The kernel path is:

AutoRoundConfig → dispatches to GPTQMarlinConfig → GPTQMarlinLinearMethod

Marlin loads INT4-packed weights from memory, dequantizes on-the-fly to FP16, computes via FP16 Tensor Cores (mma.sync.aligned.m16n8k16)

W4A16 is the default (INT4 weights → FP16 dequant → FP16 MMA). This works immediately without any patches.

W4A8 (FP8 MMA) was also already possible on SM120 using VLLM_MARLIN_INPUT_DTYPE=fp8. The code in generate_kernels.py and marlin.cu already explicitly included SM120. Only SM121 (DGX Spark) was blocked due to == 120 instead of >= 120.

new patches:

Python: is_device_capability(120) → is_device_capability_family(120) (matches SM12.x)

C++: major*10+minor == 120 → major == 12

Quant	Image	tok/s (medium)	tok/s (long)	Math (50)	Memory
INT4 W4A16	vllm-next	54.9	53.2	88%	8 GiB
INT4 W4A8	vllm-next	55.1	54.2	88%	8 GiB
NVFP4	vllm-next	43.1	42.0	86%	8 GiB
FP8	vllm-glm	42.4	41.4	94%	16 GiB
BF16	vllm-glm	27.3	26.7	94%	32 GB

Tests with glm 4.7 flash

eugr · February 14, 2026, 5:11am

Yeah, Marlin kernel is very efficient on Spark, even though sm12x doesn’t support INT4 natively.

Topic		Replies	Views
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	7768	March 28, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	11763	May 15, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2465	December 25, 2025
Two multi-node DGX Spark wins: RoCE 2× inference throughput + Qwen3.5-397B-A17B-NVFP4 serving (with SM121 CUTLASS patch) DGX Spark / GB10 Projects	4	675	April 16, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1445	February 13, 2026
Why 273 GB/s? Less Is More, Until It Isn’t DGX Spark / GB10	67	2400	March 27, 2026
Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) DGX Spark / GB10 Projects jetson , nemotron	16	2107	April 26, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2680	March 26, 2026
Running GLM-4.7-FP8 (355B MoE) on 4x DGX Spark with SGLang + EAGLE Speculative Decoding DGX Spark / GB10 Projects	39	1985	April 20, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	3039	December 31, 2025

FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect

DGX Spark (GB10, 273 GB/s LPDDR5x)

Spiegel 2 (RTX PRO 6000, 1800 GB/s GDDR7)

Fussnoten

Ziel

Referenz

Plattformen

Testmatrix

DGX Spark (SM121, 273 GB/s)

Spiegel 2 (SM120, 1800 GB/s)

Quantisierungsformate

GEMM-Kernel-Bibliotheken nach Quantisierung

Aktueller Status auf unseren Plattformen

Strategische Bewertung: CUTLASS vs DeepGEMM

DeepGEMM — Ergebnis

SM120/SM121 Architektur-Detail (korrigiert nach BTankut)

CUTLASS SM120/SM121 — Voraussetzungen (BTankut-Analyse)

Was fehlt / was wuerde helfen

Theoretische Limits (zero context, single request)

Draft-Modelle

Ergebnistabellen (Ziel)

Theorie-Formel (memory-bound, single request, zero context)

DGX Spark (GB10, 273 GB/s LPDDR5x)

Spiegel 2 (RTX PRO 6000, 1800 GB/s GDDR7)

Fussnoten

Qwen3-Coder Speculative Decoding Benchmarks

BF16 — Stabil, kein Einfluss durch Speculation

FP8 + Speculation — KEINE Degradierung

NVFP4 + Speculation — DEGRADIERUNG 2–12 Prozentpunkte

Zusammenfassung

Related topics