I have high hopes for Mistral-Small-4-119B-2603-NVFP4 but I haven’t figured out how to get it running on my pair of GB10’s. Getting Devstral 2 to run was a challenge and this is no exception for me. It seems to run out of memory in the Parse safetensors files stage of loading using a fairly fresh build from spark-vllm-docker.
My recipe snippet so far:
```
defaults:
port: 8355
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.8
max_model_len: 262144
Environment variables
env: {}
The vLLM serve command template
command: |
vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
–enable-auto-tool-choice \
–reasoning-parser mistral \
–tool-call-parser mistral \
–gpu-memory-utilization {gpu_memory_utilization} \
–host {host} \
–port {port} \
–max-model-len {max_model_len} \
-tp {tensor_parallel} \
–distributed-executor-backend ray \
–max-num-batched-tokens 16384 \
–max-num-seqs 128 \
–attention-backend TRITON_MLA
```
The docker image has mistral_common and I’ve also tried to run with --load-format mistral and --safetensors-load-strategy lazy with no improvement.
Anyone manage to get this running on vLLM yet?