Mistral-Small-4-119B-2603-NVFP4

I have high hopes for Mistral-Small-4-119B-2603-NVFP4 but I haven’t figured out how to get it running on my pair of GB10’s. Getting Devstral 2 to run was a challenge and this is no exception for me. It seems to run out of memory in the Parse safetensors files stage of loading using a fairly fresh build from spark-vllm-docker.

My recipe snippet so far:

```
defaults:
port: 8355
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.8
max_model_len: 262144

Environment variables

env: {}

The vLLM serve command template

command: |
vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
–enable-auto-tool-choice \
–reasoning-parser mistral \
–tool-call-parser mistral \
–gpu-memory-utilization {gpu_memory_utilization} \
–host {host} \
–port {port} \
–max-model-len {max_model_len} \
-tp {tensor_parallel} \
–distributed-executor-backend ray \
–max-num-batched-tokens 16384 \
–max-num-seqs 128 \
–attention-backend TRITON_MLA
```

The docker image has mistral_common and I’ve also tried to run with --load-format mistral and --safetensors-load-strategy lazy with no improvement.

Anyone manage to get this running on vLLM yet?

try bumping that up to 0.9 and make sure Swap is enabled on your spark.

Let me know if that works

Hrmmm I’ve got an NVFP4 variant humming along on a single node seems to be working fine

My script uses no-ray and it works with 0.7, there is more than enough ram for cache etc. It fits well on just one so 2 just a speed bonus

Ah I figured it out. Adding this was my fix:

env:
HF_HUB_OFFLINE: “1”
TRANSFORMERS_OFFLINE: “1”

Most folks don’t need this though.

I also added –load-format mistral. Also it appeared to automatically choose FLASHINFER_CUTLESS if anyone is interested.