Bfloat16 Quality = Speed?

joshua.dale.warner · May 7, 2026, 6:03am

--dtype is related to computation and activations. bfloat16 is appropriate for Int4-Autoround because this is actually W4A16 (the A16 is 16-bit activations). It does not change anything for that quant type - but it would for FP8 or NVFP4, because those are most often W8A8 or W4A4 (the activations are not 16 bit at base).

The KV cache is separate. --kv-cache-dtype fp8 I think chooses one of the underlying fp8_e4m3 or fp8_e5m2 datatypes. The former is earlier, more compatible, and generally thought to be inferior to e5m2 - however, some implementations or combinations are incompatible with e5m2. A good example is the Gemma4-31B-it model with MTP I posted about here: Gemma4 draft models are now available - #8 by joshua.dale.warner - I tried explicit fp8_e5m2 and it crashes. Incompatible. Works great with e4m3.

Topic		Replies	Views
Qwen3.6-27B is out! DGX Spark / GB10 agentic-ai	266	21476	May 31, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	264	23034	May 31, 2026
Qwen3.5 27B optimisation thread starting at 30+ t/s TP=1 DGX Spark / GB10 llama , agentic-ai	23	2570	May 11, 2026
Fastest Qwen 3.5 122B Int4 recipe on DGX Spark tested and published on Spark-Arena DGX Spark / GB10 llama	13	1078	May 25, 2026
Qwen3.6-27B-Dflash link DGX Spark / GB10 Projects	22	3788	April 29, 2026
DFlash LLM for DGX Spark - too good to be true? DGX Spark / GB10	37	3040	April 17, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	408	18060	May 26, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5665	March 16, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10567	April 9, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9721	March 24, 2026

Bfloat16 Quality = Speed?

Related topics