Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

I am impatient.

In spark-vllm-docker I added mistral-common up grade to the Dockerfile. I had already built the image and hit the reasoning template error, so from there I overwrote the vllm/tokenizers/mistral.py with the below diff applied from a fresh vllm clone.

FROM vllm-node:latest

COPY ./mistral.py /usr/local/lib/python3.12/dist-packages/vllm/tokenizers/mistral.py
 
diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index e20f1edd4..e8291ccf2 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -428,16 +428,36 @@ class MistralTokenizer(TokenizerLike):
         truncation = kwargs.get("truncation", False)
         max_length = kwargs.get("max_length")
 
-        version_kwargs = {}
-        # NOTE: This is for backward compatibility.
-        # Transformers should be passed arguments it knows.
-        if self.version >= 15:
-            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")
+        # Extract reasoning_effort before passing to transformers, which
+        # does not accept it as a kwarg. Instead, we pass it directly to
+        # mistral-common's ChatCompletionRequest via from_openai().
+        reasoning_effort = kwargs.get("reasoning_effort")
 
         messages, tools = _prepare_apply_chat_template_tools_and_messages(
             messages, tools, continue_final_message, add_generation_prompt
         )
 
+        # When reasoning_effort is set, bypass transformers' apply_chat_template
+        # (which rejects unknown kwargs) and use mistral-common directly.
+        if reasoning_effort is not None and self.version >= 15:
+            request_kwargs: dict[str, Any] = {
+                "reasoning_effort": reasoning_effort,
+            }
+            chat_request = MistralChatCompletionRequest.from_openai(
+                messages=messages,
+                tools=tools,
+                continue_final_message=continue_final_message,
+                **request_kwargs,
+            )
+            encoded = self.mistral.encode_chat_completion(chat_request)
+            result: str | list[int] = encoded.tokens if tokenize else encoded.text
+
+            if tokenize and truncation and max_length is not None:
+                assert isinstance(result, list)
+                result = result[:max_length]
+
+            return result  # type: ignore[return-value]
+
         return self.transformers_tokenizer.apply_chat_template(
             conversation=messages,
             tools=tools,
@@ -448,7 +468,6 @@ class MistralTokenizer(TokenizerLike):
             max_length=max_length,
             return_tensors=None,
             return_dict=False,
-            **version_kwargs,
         )
 
     def decode(

VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_TEST_FORCE_FP8_MARLIN=1
      vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
      --max-model-len 150000
      --tool-call-parser mistral
      --tokenizer-mode mistral
      --config-format mistral
      --load-format mistral
      --reasoning-parser mistral
      --enable-auto-tool-choice
      --reasoning-parser mistral
      --max_num_batched_tokens 16384
      --max_num_seqs 8
      --gpu_memory_utilization 0.9

Performance is peaking around 30 tps during text generation. I’ll work on getting a real bench mark run tonight I hope.

llm-server  | (APIServer pid=1) INFO 03-19 21:36:45 [loggers.py:259] Engine 000: Avg prompt throughput: 69.7 tokens/s, Avg generation throughput: 3.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.9%
vllm-server  | (APIServer pid=1) INFO 03-19 21:36:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.9%
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:57980 - "GET /v1/models HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:57988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO 03-19 21:37:25 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 12.9%
vllm-server  | (APIServer pid=1) INFO 03-19 21:37:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 12.9%
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:53372 - "GET /v1/models HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO:     10.0.1.172:53386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server  | (APIServer pid=1) INFO 03-19 21:37:45 [loggers.py:259] Engine 000: Avg prompt throughput: 45.1 tokens/s, Avg generation throughput: 19.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 24.5%