Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

chuckchambersdev · March 19, 2026, 10:12pm

'm sorry for the confusion in my earlier reply — I’ve now done a proper comparison and here are the results.

Testing eugr/spark-vllm-docker vs avarok v23 with Mistral Small 4 119B NVFP4

Previous working setup (avarok v23)

The original guide used avarok/dgx-vllm-nvfp4-kernel:v23 with two required workarounds:

Manual pip install --upgrade mistral_common inside the container (for tokenizer v15 support)
VLLM_MLA_DISABLE=1 environment variable (MLA backends rejected head_size=320 on SM121)

Build eugr’s image

bash

git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker
./build-and-copy.sh

Build completes in ~10 minutes on a single Spark using precompiled wheels. The “No host specified, skipping copy” message at the end is normal for single-node — the image is local as vllm-node.

Issue: reasoning_effort bug in vLLM 0.17.2rc1

vLLM 0.17.2rc1 (in eugr’s current build) unconditionally passes reasoning_effort to apply_chat_template, but mistral_common 1.10.0 doesn’t support it. Every inference request returns:

Kwargs ['reasoning_effort'] are not supported by MistralCommonTokenizer.apply_chat_template

This is the bug PR #37081 is fixing, but the PR doesn’t apply cleanly to current main (too many code changes since it was opened).

Workaround: one-line patch

Create a patched image with this fix baked in:

patch_mistral.py:

python

path = '/usr/local/lib/python3.12/dist-packages/vllm/tokenizers/mistral.py'
content = open(path).read()
old = '        if self.version >= 15:\n            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")'
new = ('        if self.version >= 15:\n'
       '            _re = kwargs.get("reasoning_effort")\n'
       '            if _re is not None:\n'
       '                try:\n'
       '                    from mistral_common.protocol.instruct.messages import REASONING_EFFORTS\n'
       '                    version_kwargs["reasoning_effort"] = _re\n'
       '                except (ImportError, AttributeError):\n'
       '                    pass')
assert old in content, 'Pattern not found'
open(path, 'w').write(content.replace(old, new))
print('Patched successfully')

Dockerfile.patch:

dockerfile

FROM vllm-node
COPY patch_mistral.py /tmp/patch_mistral.py
RUN python3 /tmp/patch_mistral.py

bash

docker build -t vllm-node-patched -f Dockerfile.patch .

Working serve command (no VLLM_MLA_DISABLE needed)

bash

docker run \
  --name mistral-small-4 \
  --privileged --gpus all --rm \
  --network host --ipc=host \
  -v /path/to/Mistral-Small-4-119B-2603-NVFP4:/model \
  -v /home/$USER/flashinfer-cache:/root/.cache/flashinfer \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_TEST_FORCE_FP8_MARLIN=1 \
  vllm-node-patched \
  vllm serve /model \
    --served-model-name mistral-small-4 \
    --tokenizer-mode mistral \
    --config-format mistral \
    --load-format mistral \
    --host 0.0.0.0 --port 8005 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.75 \
    --max-model-len 40000 \
    --tool-call-parser mistral \
    --enable-auto-tool-choice

MLA status

eugr’s vLLM 0.17.2rc1 fixed the MLA head_size=320 rejection on SM121 that blocked avarok v23. Confirmed by startup log:

Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA'].
Using FlashAttention prefill for MLA

VLLM_MLA_DISABLE=1 is no longer needed.

Benchmark comparison (7-run warm average, 1000 tokens, same prompt)

Setup	Image	MLA	mistral_common	Avg tok/s
Original guide	avarok v23 + upgraded mistral_common	❌ Disabled	Manual upgrade required	27.7
This approach	eugr vllm-node-patched	✅ Enabled	1.10.0 (included)	28.0

No performance regression from enabling MLA. eugr’s image is simpler — no manual mistral_common upgrade needed, just the one-time patch.

Summary

eugr’s image is the cleaner path forward. The reasoning_effort patch is a temporary workaround until PR #37081 lands in main, at which point a simple rebuild with ./build-and-copy.sh will include the fix natively.

Topic		Replies	Views
Running Mistral Small 4 (119B MoE) on DGX Spark with SGLang — Full Setup & Benchmarks DGX Spark / GB10 agentic-ai	9	1184	May 20, 2026
Mistral-Small-4-119B-2603-NVFP4 DGX Spark / GB10 Projects	4	462	June 6, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	89	4549	February 13, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	144	8565	March 14, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2586	December 25, 2025
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	143	7424	February 24, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5523	December 9, 2025
vLLM containers DGX Spark / GB10	44	2063	March 28, 2026
Run VLLM in Spark DGX Spark / GB10	156	14064	June 8, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3183	December 17, 2025

Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

Testing eugr/spark-vllm-docker vs avarok v23 with Mistral Small 4 119B NVFP4

Previous working setup (avarok v23)

Build eugr’s image

Issue: reasoning_effort bug in vLLM 0.17.2rc1

Workaround: one-line patch

Working serve command (no VLLM_MLA_DISABLE needed)

MLA status

Benchmark comparison (7-run warm average, 1000 tokens, same prompt)

Summary

Related topics