'm sorry for the confusion in my earlier reply — I’ve now done a proper comparison and here are the results.
Testing eugr/spark-vllm-docker vs avarok v23 with Mistral Small 4 119B NVFP4
Previous working setup (avarok v23)
The original guide used avarok/dgx-vllm-nvfp4-kernel:v23 with two required workarounds:
-
Manual
pip install --upgrade mistral_commoninside the container (for tokenizer v15 support) -
VLLM_MLA_DISABLE=1environment variable (MLA backends rejected head_size=320 on SM121)
Build eugr’s image
bash
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh
Build completes in ~10 minutes on a single Spark using precompiled wheels. The “No host specified, skipping copy” message at the end is normal for single-node — the image is local as vllm-node.
Issue: reasoning_effort bug in vLLM 0.17.2rc1
vLLM 0.17.2rc1 (in eugr’s current build) unconditionally passes reasoning_effort to apply_chat_template, but mistral_common 1.10.0 doesn’t support it. Every inference request returns:
Kwargs ['reasoning_effort'] are not supported by MistralCommonTokenizer.apply_chat_template
This is the bug PR #37081 is fixing, but the PR doesn’t apply cleanly to current main (too many code changes since it was opened).
Workaround: one-line patch
Create a patched image with this fix baked in:
patch_mistral.py:
python
path = '/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/mistral.py'
content = open(path).read()
old = ' if self.version >= 15:\n version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")'
new = (' if self.version >= 15:\n'
' _re = kwargs.get("reasoning_effort")\n'
' if _re is not None:\n'
' try:\n'
' from mistral_common.protocol.instruct.messages import REASONING_EFFORTS\n'
' version_kwargs["reasoning_effort"] = _re\n'
' except (ImportError, AttributeError):\n'
' pass')
assert old in content, 'Pattern not found'
open(path, 'w').write(content.replace(old, new))
print('Patched successfully')
Dockerfile.patch:
dockerfile
FROM vllm-node
COPY patch_mistral.py /tmp/patch_mistral.py
RUN python3 /tmp/patch_mistral.py
bash
docker build -t vllm-node-patched -f Dockerfile.patch .
Working serve command (no VLLM_MLA_DISABLE needed)
bash
docker run \
--name mistral-small-4 \
--privileged --gpus all --rm \
--network host --ipc=host \
-v /path/to/Mistral-Small-4-119B-2603-NVFP4:/model \
-v /home/$USER/flashinfer-cache:/root/.cache/flashinfer \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_TEST_FORCE_FP8_MARLIN=1 \
vllm-node-patched \
vllm serve /model \
--served-model-name mistral-small-4 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--host 0.0.0.0 --port 8005 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.75 \
--max-model-len 40000 \
--tool-call-parser mistral \
--enable-auto-tool-choice
MLA status
eugr’s vLLM 0.17.2rc1 fixed the MLA head_size=320 rejection on SM121 that blocked avarok v23. Confirmed by startup log:
Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
Using FlashAttention prefill for MLA
VLLM_MLA_DISABLE=1 is no longer needed.
Benchmark comparison (7-run warm average, 1000 tokens, same prompt)
| Setup | Image | MLA | mistral_common | Avg tok/s |
|---|---|---|---|---|
| Original guide | avarok v23 + upgraded mistral_common | ❌ Disabled | Manual upgrade required | 27.7 |
| This approach | eugr vllm-node-patched | ✅ Enabled | 1.10.0 (included) | 28.0 |
No performance regression from enabling MLA. eugr’s image is simpler — no manual mistral_common upgrade needed, just the one-time patch.
Summary
eugr’s image is the cleaner path forward. The reasoning_effort patch is a temporary workaround until PR #37081 lands in main, at which point a simple rebuild with ./build-and-copy.sh will include the fix natively.