North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss

jeremyk · June 17, 2026, 4:26pm

Hey all,

I just put up two Spark Arena runs of North Mini Code 1.0 — an FP8 reference and an NVFP4 quant we made — to see what the GB10’s native FP4 support buys us. It’s Cohere’s first open agentic coding model: a 30B MoE (3B active), Apache 2.0, built for exactly the kind of run-it-yourself, sovereign setup the Spark is great for. Blog here: North Mini Code: Agentic Coding Model for Developers | Cohere

The results, same model / same recipe / same Spark, only the quant changed:

Single user @ 16K context (realistic): ~52 tok/s on NVFP4 vs ~32 on FP8 → ~1.65x faster
Two concurrent users: scales to ~84 tok/s aggregate (the Spark Arena figure)
Memory: 17 GB weights vs 28 GB → ~40% smaller footprint
Quality: identical HumanEval across NVFP4 and FP8 — no measurable loss

Benchmarks & Recipe:

Both run on a single Spark (tensor parallel 1) under vLLM with FP8 KV cache, tool calling + reasoning via the cohere_command4 parsers. Recipes and full PP/TG-vs-concurrency logs are on both pages if you want to reproduce.

Fun side note: looks like this is the only Cohere model on the board so far, so a shout out to the Cohere folks for putting out such a solid little agentic coding model. Getting ~1.65x and a 40% smaller footprint for no quality hit makes it a really nice fit for the Spark.

Would love to hear how it runs on other people’s setups, and if anyone wants to stress the quant on heavier coding workloads than HumanEval, I’m all ears. Feedback welcome!

Cheers!

coder543 · June 17, 2026, 5:23pm

I think any 4-bit quant can get those output tok/s benefits, since it is just memory bandwidth bound, and 4-bit models are about the same size.

I could be wrong, but I think real potential benefit of NVFP4 is more efficient use of the tensor cores for prefill (prompt processing). It would be interesting to see how many tokens/sec you’re getting for that.

Unfortunately, in my testing, North Mini Code just doesn’t seem to be good enough for me to have any great use for it yet, but I look forward to a future version 2.

jeremyk · June 17, 2026, 5:29pm

NVFP4 PP:

FP8 PP:

VCR · June 17, 2026, 7:32pm

Very interested in the Cohere model and anything nvfp4. Context length that’s realistic for me, is upwards to 128k. Until that degradation curve improves reliability parallel to context, I’ll stay excited for their future releases.

jeremyk · June 17, 2026, 9:18pm

For sure, it does go to 256k context but it does slow it the tok/s as you scale up.. looks like it’s comparable to a qwen coder speed decrease. What I like about this model is that its small enough, it codes well and seems to be able to know when to call a tool properly which isn’t always found in some coding focused models, and is becoming more and more important in my day to day. Heres a more detailed writeup I found:

wcw · June 18, 2026, 5:16pm

I played around with bf16 and fp8 when it came out for a little bit. The performance, even at bf16, seemed to be quite good and it scored well with tool-eval-bench. However, when I went to use it with Claude code it seemed to have all sorts of problems with actually generating code and the tool calls that it wanted me to approve seemed somewhat suspect. I’m afraid this was just a quick test and so it’s quite possible I was doing something wrong in the process.

jeremyk · June 18, 2026, 10:53pm

If you try out my spark-arena sparkrun vllm recipe it might help, because I’ve implemented using the cohere_command4 tool + reasoning parser. It was merged into into the sparkrun github registry recently or can see the recipe here: XanuNetworks/North-Mini-Code-1.0-NVFP4 - Spark Arena Benchmark

Topic		Replies	Views
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	213	7093	March 13, 2026
MiniMax M2.7 NFVP4 Recipe & Benchmarks DGX Spark / GB10 llama	125	12576	July 9, 2026
Best Q4 / NVFP4 model for quality Qwen3.5-27B or alternatives? DGX Spark / GB10 llama , deepseek , nemotron	16	4402	April 26, 2026
Qwen3-Next AWQ 4bit vs FP8 vs NVFP4 on single spark DGX Spark / GB10 jetson , llama , nemotron	7	2707	February 23, 2026
Can someone with 2 Sparks benchmark NVFP4 MiniMax M2.1 quant? DGX Spark / GB10	24	1555	January 15, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	9292	March 14, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13599	May 15, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2942	March 26, 2026
GB10 really does hit ~1 PFLOP NVFP4 (2:4 sparse) — measured, with an open-source tool to reproduce it DGX Spark / GB10 Projects performance , performance-counters , performance-metrics	22	1438	June 25, 2026
NVFP4 quantization of a 100B-class Llama on 2× DGX Spark — lessons + open questions DGX Spark / GB10 llama	5	405	May 15, 2026

North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss

Related topics