Spent some time getting an NVFP4 quant of a Mistral Small 4 Heretic running. Seems rock solid without the digital precision issues that plagued out Heretic quants of Mistral Small 4 suffered
Feel free to toy with it runs fine on a single node GB10
Spent some time getting an NVFP4 quant of a Mistral Small 4 Heretic running. Seems rock solid without the digital precision issues that plagued out Heretic quants of Mistral Small 4 suffered
Feel free to toy with it runs fine on a single node GB10
HI. Were you able to make it run with our community vllm docker image ? I always got error: (APIServer pid=61) Value error, Failed to load mistral ‘params.json’ config for model GulfCoastAI/Mistral-Small-4-119B-Heretic-NVFP4. Please check if the model is a mistral-format model and if the config file exists. [type=value_error, input_value=ArgsKwargs((), {‘model’: …nderer_num_workers’: 1}), input_ty
pe=ArgsKwargs]
(APIServer pid=61) For further information visit Validation Errors | Pydantic Docs
thanks !
You’re loading it in Mistral format, but the HF repo is in HF compressed-tensors format. That mismatch is the entire error.
The repo has config.json + model.safetensors.index.json (HF format) and deliberately no params.json / tekken.json / consolidated-*.safetensors (the Mistral-native files). So any of these flags will trigger that exact error because vLLM goes looking for params.json and it isn’t there:
--load-format mistral--config-format mistral--tokenizer-mode mistralStep 1 — drop the mistral-format flags. That clears the params.json error.
Step 2 — but it still won’t serve on stock vLLM, and you should know that up front so you don’t chase it: vLLM has no mistral4 HF text-model class, so it can’t resolve the inner language_model of the Mistral 4 multimodal config (it tends to misresolve as deepseek_v2 and break). This is the wall we hit too — it’s why the model card front-loads the caveat.
The supported path (it’s in the repo’s recipes/, this is exactly how I serve it):
recipes/convert-mistral4-to-native.py — converts the HF compressed-tensors repo → native Mistral format. It’s a byte-identical tensor rename plus splicing vanilla Mistral’s BF16 vision tower back in (so it resolves as PixtralForConditionalGeneration, not deepseek_v2). That produces the params.json + consolidated safetensors the mistral loader wants.recipes/serve-mistral4-heretic-native.sh — serves the converted artifact with --load-format mistral and the SM12x env bundle.The gotcha you’ll hit next : on Blackwell SM12x (GB10 / DGX Spark / consumer Blackwell) you must set VLLM_MLA_DISABLE=1. Without it, vLLM selects the TRITON_MLA decode kernel, which crashes on Mistral 4’s kv_lora_rank=256 with Cannot make_shape_compatible: incompatible dimensions ... 256 and 512. That’s an upstream attention-backend bug, independent of the quant — I filed it as vllm-project/vllm#45031. VLLM_MLA_DISABLE=1 routes it to FLASH_ATTN and it serves clean. The serve script already sets it (alongside VLLM_NVFP4_GEMM_BACKEND=marlin and the rest).
Vllm added mistral 4 support in vllm 0.21 vanilla. Used it on a single and dual sparks many times. No alchemy needed