Hey all,
I just put up two Spark Arena runs of North Mini Code 1.0 — an FP8 reference and an NVFP4 quant we made — to see what the GB10’s native FP4 support buys us. It’s Cohere’s first open agentic coding model: a 30B MoE (3B active), Apache 2.0, built for exactly the kind of run-it-yourself, sovereign setup the Spark is great for. Blog here: North Mini Code: Agentic Coding Model for Developers | Cohere
The results, same model / same recipe / same Spark, only the quant changed:
- Single user @ 16K context (realistic): ~52 tok/s on NVFP4 vs ~32 on FP8 → ~1.65x faster
- Two concurrent users: scales to ~84 tok/s aggregate (the Spark Arena figure)
- Memory: 17 GB weights vs 28 GB → ~40% smaller footprint
- Quality: identical HumanEval across NVFP4 and FP8 — no measurable loss
Benchmarks & Recipe:
- FP8: CohereLabs/North-Mini-Code-1.0-fp8 - Spark Arena Benchmark
- NVFP4: XanuNetworks/North-Mini-Code-1.0-NVFP4 - Spark Arena Benchmark
Both run on a single Spark (tensor parallel 1) under vLLM with FP8 KV cache, tool calling + reasoning via the cohere_command4 parsers. Recipes and full PP/TG-vs-concurrency logs are on both pages if you want to reproduce.
Fun side note: looks like this is the only Cohere model on the board so far, so a shout out to the Cohere folks for putting out such a solid little agentic coding model. Getting ~1.65x and a 40% smaller footprint for no quality hit makes it a really nice fit for the Spark.
Would love to hear how it runs on other people’s setups, and if anyone wants to stress the quant on heavier coding workloads than HumanEval, I’m all ears. Feedback welcome!
Cheers!

