Mistral-Small-4-119B-2603-NVFP4

bugsareyummy · June 5, 2026, 12:35pm

I have high hopes for Mistral-Small-4-119B-2603-NVFP4 but I haven’t figured out how to get it running on my pair of GB10’s. Getting Devstral 2 to run was a challenge and this is no exception for me. It seems to run out of memory in the Parse safetensors files stage of loading using a fairly fresh build from spark-vllm-docker.

My recipe snippet so far:

```
defaults:
port: 8355
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.8
max_model_len: 262144

Environment variables

env: {}

The vLLM serve command template

command: |
vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
–enable-auto-tool-choice \
–reasoning-parser mistral \
–tool-call-parser mistral \
–gpu-memory-utilization {gpu_memory_utilization} \
–host {host} \
–port {port} \
–max-model-len {max_model_len} \
-tp {tensor_parallel} \
–distributed-executor-backend ray \
–max-num-batched-tokens 16384 \
–max-num-seqs 128 \
–attention-backend TRITON_MLA
```

The docker image has mistral_common and I’ve also tried to run with --load-format mistral and --safetensors-load-strategy lazy with no improvement.

Anyone manage to get this running on vLLM yet?

azampatti · June 5, 2026, 2:51pm

try bumping that up to 0.9 and make sure Swap is enabled on your spark.

Let me know if that works

robert287 · June 5, 2026, 5:43pm

Hrmmm I’ve got an NVFP4 variant humming along on a single node seems to be working fine

0rand · June 5, 2026, 7:16pm

My script uses no-ray and it works with 0.7, there is more than enough ram for cache etc. It fits well on just one so 2 just a speed bonus

bugsareyummy · June 6, 2026, 9:17pm

Ah I figured it out. Adding this was my fix:

env:
HF_HUB_OFFLINE: “1”
TRANSFORMERS_OFFLINE: “1”

Most folks don’t need this though.

I also added –load-format mistral. Also it appeared to automatically choose FLASHINFER_CUTLESS if anyone is interested.

Topic		Replies	Views
Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10) DGX Spark / GB10 deepseek	65	5067	May 18, 2026
Mistral Small 4 Heretic NVFP4 Build for GB10 DGX Spark / GB10	3	371	June 13, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1893	February 13, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2705	December 25, 2025
Running Mistral Small 4 (119B MoE) on DGX Spark with SGLang — Full Setup & Benchmarks DGX Spark / GB10 agentic-ai	9	1320	May 20, 2026
Your GPU does not have native support for FP4 computation but FP4 quantization is being used DGX Spark / GB10	5	1915	January 30, 2026
vLLM containers DGX Spark / GB10	44	2302	March 28, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7741	February 24, 2026
Marlin Fix: NVFP4 Actually Works on SM121 (DGX Spark) DGX Spark / GB10 Projects jetson , nemotron	15	2653	April 12, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4701	February 13, 2026

Mistral-Small-4-119B-2603-NVFP4

Environment variables

The vLLM serve command template

Related topics