North Mini Code in NVFP4 — ~1.65x over FP8, 40% less memory, zero quality loss

Hey all,

I just put up two Spark Arena runs of North Mini Code 1.0 — an FP8 reference and an NVFP4 quant we made — to see what the GB10’s native FP4 support buys us. It’s Cohere’s first open agentic coding model: a 30B MoE (3B active), Apache 2.0, built for exactly the kind of run-it-yourself, sovereign setup the Spark is great for. Blog here: North Mini Code: Agentic Coding Model for Developers | Cohere

The results, same model / same recipe / same Spark, only the quant changed:

  • Single user @ 16K context (realistic): ~52 tok/s on NVFP4 vs ~32 on FP8 → ~1.65x faster
  • Two concurrent users: scales to ~84 tok/s aggregate (the Spark Arena figure)
  • Memory: 17 GB weights vs 28 GB → ~40% smaller footprint
  • Quality: identical HumanEval across NVFP4 and FP8 — no measurable loss

Benchmarks & Recipe:

Both run on a single Spark (tensor parallel 1) under vLLM with FP8 KV cache, tool calling + reasoning via the cohere_command4 parsers. Recipes and full PP/TG-vs-concurrency logs are on both pages if you want to reproduce.

Fun side note: looks like this is the only Cohere model on the board so far, so a shout out to the Cohere folks for putting out such a solid little agentic coding model. Getting ~1.65x and a 40% smaller footprint for no quality hit makes it a really nice fit for the Spark.

Would love to hear how it runs on other people’s setups, and if anyone wants to stress the quant on heavier coding workloads than HumanEval, I’m all ears. Feedback welcome!

Cheers!

I think any 4-bit quant can get those output tok/s benefits, since it is just memory bandwidth bound, and 4-bit models are about the same size.

I could be wrong, but I think real potential benefit of NVFP4 is more efficient use of the tensor cores for prefill (prompt processing). It would be interesting to see how many tokens/sec you’re getting for that.

Unfortunately, in my testing, North Mini Code just doesn’t seem to be good enough for me to have any great use for it yet, but I look forward to a future version 2.

NVFP4 PP:

FP8 PP:

Very interested in the Cohere model and anything nvfp4. Context length that’s realistic for me, is upwards to 128k. Until that degradation curve improves reliability parallel to context, I’ll stay excited for their future releases.