Mistral Small 4 Heretic NVFP4 Build for GB10

robert287 · June 9, 2026, 5:08pm

Spent some time getting an NVFP4 quant of a Mistral Small 4 Heretic running. Seems rock solid without the digital precision issues that plagued out Heretic quants of Mistral Small 4 suffered

Feel free to toy with it runs fine on a single node GB10

trithemius · June 12, 2026, 7:05am

HI. Were you able to make it run with our community vllm docker image ? I always got error: (APIServer pid=61) Value error, Failed to load mistral ‘params.json’ config for model GulfCoastAI/Mistral-Small-4-119B-Heretic-NVFP4. Please check if the model is a mistral-format model and if the config file exists. [type=value_error, input_value=ArgsKwargs((), {‘model’: …nderer_num_workers’: 1}), input_ty
pe=ArgsKwargs]
(APIServer pid=61) For further information visit Validation Errors | Pydantic Docs

thanks !

robert287 · June 12, 2026, 3:16pm

You’re loading it in Mistral format, but the HF repo is in HF compressed-tensors format. That mismatch is the entire error.

The repo has config.json + model.safetensors.index.json (HF format) and deliberately no params.json / tekken.json / consolidated-*.safetensors (the Mistral-native files). So any of these flags will trigger that exact error because vLLM goes looking for params.json and it isn’t there:

--load-format mistral
--config-format mistral
--tokenizer-mode mistral

Step 1 — drop the mistral-format flags. That clears the params.json error.

Step 2 — but it still won’t serve on stock vLLM, and you should know that up front so you don’t chase it: vLLM has no mistral4 HF text-model class, so it can’t resolve the inner language_model of the Mistral 4 multimodal config (it tends to misresolve as deepseek_v2 and break). This is the wall we hit too — it’s why the model card front-loads the caveat.

The supported path (it’s in the repo’s recipes/, this is exactly how I serve it):

recipes/convert-mistral4-to-native.py — converts the HF compressed-tensors repo → native Mistral format. It’s a byte-identical tensor rename plus splicing vanilla Mistral’s BF16 vision tower back in (so it resolves as PixtralForConditionalGeneration, not deepseek_v2). That produces the params.json + consolidated safetensors the mistral loader wants.
recipes/serve-mistral4-heretic-native.sh — serves the converted artifact with --load-format mistral and the SM12x env bundle.

The gotcha you’ll hit next : on Blackwell SM12x (GB10 / DGX Spark / consumer Blackwell) you must set VLLM_MLA_DISABLE=1. Without it, vLLM selects the TRITON_MLA decode kernel, which crashes on Mistral 4’s kv_lora_rank=256 with Cannot make_shape_compatible: incompatible dimensions ... 256 and 512. That’s an upstream attention-backend bug, independent of the quant — I filed it as vllm-project/vllm#45031. VLLM_MLA_DISABLE=1 routes it to FLASH_ATTN and it serves clean. The serve script already sets it (alongside VLLM_NVFP4_GEMM_BACKEND=marlin and the rest).

0rand · June 13, 2026, 5:12pm

Vllm added mistral 4 support in vllm 0.21 vanilla. Used it on a single and dual sparks many times. No alchemy needed

Topic		Replies	Views
Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10) DGX Spark / GB10 deepseek	65	5067	May 18, 2026
Mistral-Small-4-119B-2603-NVFP4 DGX Spark / GB10 Projects	4	552	June 6, 2026
Running Mistral Small 4 (119B MoE) on DGX Spark with SGLang — Full Setup & Benchmarks DGX Spark / GB10 agentic-ai	9	1320	May 20, 2026
New MoE, perfect fit for DGX Spark? mistralai/Leanstral-2603 DGX Spark / GB10	12	689	March 18, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2705	December 25, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	234	13314	May 15, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	1913	January 30, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3293	December 17, 2025
NVFP4 quantization of a 100B-class Llama on 2× DGX Spark — lessons + open questions DGX Spark / GB10 llama	5	394	May 15, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1891	February 13, 2026

Mistral Small 4 Heretic NVFP4 Build for GB10

Related topics