Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

drew22 · March 19, 2026, 9:53pm

I am impatient.

In spark-vllm-docker I added mistral-common up grade to the Dockerfile. I had already built the image and hit the reasoning template error, so from there I overwrote the vllm/tokenizers/mistral.py with the below diff applied from a fresh vllm clone.

FROM vllm-node:latest

COPY ./mistral.py /usr/local/lib/python3.12/dist-packages/vllm/tokenizers/mistral.py

diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index e20f1edd4..e8291ccf2 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -428,16 +428,36 @@ class MistralTokenizer(TokenizerLike):
         truncation = kwargs.get("truncation", False)
         max_length = kwargs.get("max_length")
 
-        version_kwargs = {}
-        # NOTE: This is for backward compatibility.
-        # Transformers should be passed arguments it knows.
-        if self.version >= 15:
-            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")
+        # Extract reasoning_effort before passing to transformers, which
+        # does not accept it as a kwarg. Instead, we pass it directly to
+        # mistral-common's ChatCompletionRequest via from_openai().
+        reasoning_effort = kwargs.get("reasoning_effort")
 
         messages, tools = _prepare_apply_chat_template_tools_and_messages(
             messages, tools, continue_final_message, add_generation_prompt
         )
 
+        # When reasoning_effort is set, bypass transformers' apply_chat_template
+        # (which rejects unknown kwargs) and use mistral-common directly.
+        if reasoning_effort is not None and self.version >= 15:
+            request_kwargs: dict[str, Any] = {
+                "reasoning_effort": reasoning_effort,
+            }
+            chat_request = MistralChatCompletionRequest.from_openai(
+                messages=messages,
+                tools=tools,
+                continue_final_message=continue_final_message,
+                **request_kwargs,
+            )
+            encoded = self.mistral.encode_chat_completion(chat_request)
+            result: str | list[int] = encoded.tokens if tokenize else encoded.text
+
+            if tokenize and truncation and max_length is not None:
+                assert isinstance(result, list)
+                result = result[:max_length]
+
+            return result  # type: ignore[return-value]
+
         return self.transformers_tokenizer.apply_chat_template(
             conversation=messages,
             tools=tools,
@@ -448,7 +468,6 @@ class MistralTokenizer(TokenizerLike):
             max_length=max_length,
             return_tensors=None,
             return_dict=False,
-            **version_kwargs,
         )
 
     def decode(

VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_TEST_FORCE_FP8_MARLIN=1

      vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
      --max-model-len 150000
      --tool-call-parser mistral
      --tokenizer-mode mistral
      --config-format mistral
      --load-format mistral
      --reasoning-parser mistral
      --enable-auto-tool-choice
      --reasoning-parser mistral
      --max_num_batched_tokens 16384
      --max_num_seqs 8
      --gpu_memory_utilization 0.9

Performance is peaking around 30 tps during text generation. I’ll work on getting a real bench mark run tonight I hope.

llm-server  | (APIServer pid=1) INFO 03-19 21:36:45 [loggers.py:259] Engine 000: Avg prompt throughput: 69.7 tokens/s, Avg generation throughput: 3.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.9%
vllm-server  | (APIServer pid=1) INFO 03-19 21:36:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.9%
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:57980 - "GET /v1/models HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:57988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO 03-19 21:37:25 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 12.9%
vllm-server  | (APIServer pid=1) INFO 03-19 21:37:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 12.9%
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:53372 - "GET /v1/models HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:53386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO 03-19 21:37:45 [loggers.py:259] Engine 000: Avg prompt throughput: 45.1 tokens/s, Avg generation throughput: 19.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 24.5%

Topic		Replies	Views
Running Mistral Small 4 (119B MoE) on DGX Spark with SGLang — Full Setup & Benchmarks DGX Spark / GB10 agentic-ai	3	530	March 26, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4232	February 27, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6608	March 28, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2308	December 25, 2025
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	6375	March 10, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4937	December 9, 2025
vLLM containers DGX Spark / GB10	44	1412	March 28, 2026
Run VLLM in Spark DGX Spark / GB10	145	11765	April 1, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2815	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1234	February 13, 2026

Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

Related topics