Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

'm sorry for the confusion in my earlier reply — I’ve now done a proper comparison and here are the results.

Testing eugr/spark-vllm-docker vs avarok v23 with Mistral Small 4 119B NVFP4

Previous working setup (avarok v23)

The original guide used avarok/dgx-vllm-nvfp4-kernel:v23 with two required workarounds:

  1. Manual pip install --upgrade mistral_common inside the container (for tokenizer v15 support)

  2. VLLM_MLA_DISABLE=1 environment variable (MLA backends rejected head_size=320 on SM121)

Build eugr’s image

bash

git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh

Build completes in ~10 minutes on a single Spark using precompiled wheels. The “No host specified, skipping copy” message at the end is normal for single-node — the image is local as vllm-node.

Issue: reasoning_effort bug in vLLM 0.17.2rc1

vLLM 0.17.2rc1 (in eugr’s current build) unconditionally passes reasoning_effort to apply_chat_template, but mistral_common 1.10.0 doesn’t support it. Every inference request returns:

Kwargs ['reasoning_effort'] are not supported by MistralCommonTokenizer.apply_chat_template

This is the bug PR #37081 is fixing, but the PR doesn’t apply cleanly to current main (too many code changes since it was opened).

Workaround: one-line patch

Create a patched image with this fix baked in:

patch_mistral.py:

python

path = '/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/mistral.py'
content = open(path).read()
old = '        if self.version >= 15:\n            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")'
new = ('        if self.version >= 15:\n'
       '            _re = kwargs.get("reasoning_effort")\n'
       '            if _re is not None:\n'
       '                try:\n'
       '                    from mistral_common.protocol.instruct.messages import REASONING_EFFORTS\n'
       '                    version_kwargs["reasoning_effort"] = _re\n'
       '                except (ImportError, AttributeError):\n'
       '                    pass')
assert old in content, 'Pattern not found'
open(path, 'w').write(content.replace(old, new))
print('Patched successfully')

Dockerfile.patch:

dockerfile

FROM vllm-node
COPY patch_mistral.py /tmp/patch_mistral.py
RUN python3 /tmp/patch_mistral.py

bash

docker build -t vllm-node-patched -f Dockerfile.patch .

Working serve command (no VLLM_MLA_DISABLE needed)

bash

docker run \
  --name mistral-small-4 \
  --privileged --gpus all --rm \
  --network host --ipc=host \
  -v /path/to/Mistral-Small-4-119B-2603-NVFP4:/model \
  -v /home/$USER/flashinfer-cache:/root/.cache/flashinfer \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  vllm-node-patched \
  vllm serve /model \
    --served-model-name mistral-small-4 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --host 0.0.0.0 --port 8005 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.75 \
    --max-model-len 40000 \
    --tool-call-parser mistral \
    --enable-auto-tool-choice

MLA status

eugr’s vLLM 0.17.2rc1 fixed the MLA head_size=320 rejection on SM121 that blocked avarok v23. Confirmed by startup log:

Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
Using FlashAttention prefill for MLA

VLLM_MLA_DISABLE=1 is no longer needed.

Benchmark comparison (7-run warm average, 1000 tokens, same prompt)

Setup Image MLA mistral_common Avg tok/s
Original guide avarok v23 + upgraded mistral_common ❌ Disabled Manual upgrade required 27.7
This approach eugr vllm-node-patched ✅ Enabled 1.10.0 (included) 28.0

No performance regression from enabling MLA. eugr’s image is simpler — no manual mistral_common upgrade needed, just the one-time patch.

Summary

eugr’s image is the cleaner path forward. The reasoning_effort patch is a temporary workaround until PR #37081 lands in main, at which point a simple rebuild with ./build-and-copy.sh will include the fix natively.

1 Like