Mistral Small 4 Heretic NVFP4 Build for GB10

Spent some time getting an NVFP4 quant of a Mistral Small 4 Heretic running. Seems rock solid without the digital precision issues that plagued out Heretic quants of Mistral Small 4 suffered

Feel free to toy with it runs fine on a single node GB10

HI. Were you able to make it run with our community vllm docker image ? I always got error: (APIServer pid=61) Value error, Failed to load mistral ‘params.json’ config for model GulfCoastAI/Mistral-Small-4-119B-Heretic-NVFP4. Please check if the model is a mistral-format model and if the config file exists. [type=value_error, input_value=ArgsKwargs((), {‘model’: …nderer_num_workers’: 1}), input_ty
pe=ArgsKwargs]
(APIServer pid=61) For further information visit Validation Errors | Pydantic Docs

thanks !

You’re loading it in Mistral format, but the HF repo is in HF compressed-tensors format. That mismatch is the entire error.

The repo has config.json + model.safetensors.index.json (HF format) and deliberately no params.json / tekken.json / consolidated-*.safetensors (the Mistral-native files). So any of these flags will trigger that exact error because vLLM goes looking for params.json and it isn’t there:

  • --load-format mistral
  • --config-format mistral
  • --tokenizer-mode mistral

Step 1 — drop the mistral-format flags. That clears the params.json error.

Step 2 — but it still won’t serve on stock vLLM, and you should know that up front so you don’t chase it: vLLM has no mistral4 HF text-model class, so it can’t resolve the inner language_model of the Mistral 4 multimodal config (it tends to misresolve as deepseek_v2 and break). This is the wall we hit too — it’s why the model card front-loads the caveat.

The supported path (it’s in the repo’s recipes/, this is exactly how I serve it):

  1. recipes/convert-mistral4-to-native.py — converts the HF compressed-tensors repo → native Mistral format. It’s a byte-identical tensor rename plus splicing vanilla Mistral’s BF16 vision tower back in (so it resolves as PixtralForConditionalGeneration, not deepseek_v2). That produces the params.json + consolidated safetensors the mistral loader wants.
  2. recipes/serve-mistral4-heretic-native.sh — serves the converted artifact with --load-format mistral and the SM12x env bundle.

The gotcha you’ll hit next : on Blackwell SM12x (GB10 / DGX Spark / consumer Blackwell) you must set VLLM_MLA_DISABLE=1. Without it, vLLM selects the TRITON_MLA decode kernel, which crashes on Mistral 4’s kv_lora_rank=256 with Cannot make_shape_compatible: incompatible dimensions ... 256 and 512. That’s an upstream attention-backend bug, independent of the quant — I filed it as vllm-project/vllm#45031. VLLM_MLA_DISABLE=1 routes it to FLASH_ATTN and it serves clean. The serve script already sets it (alongside VLLM_NVFP4_GEMM_BACKEND=marlin and the rest).

Vllm added mistral 4 support in vllm 0.21 vanilla. Used it on a single and dual sparks many times. No alchemy needed