I am impatient.
In spark-vllm-docker I added mistral-common up grade to the Dockerfile. I had already built the image and hit the reasoning template error, so from there I overwrote the vllm/tokenizers/mistral.py with the below diff applied from a fresh vllm clone.
FROM vllm-node:latest
COPY ./mistral.py /usr/local/lib/python3.12/dist-packages/vllm/tokenizers/mistral.py
diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index e20f1edd4..e8291ccf2 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -428,16 +428,36 @@ class MistralTokenizer(TokenizerLike):
truncation = kwargs.get("truncation", False)
max_length = kwargs.get("max_length")
- version_kwargs = {}
- # NOTE: This is for backward compatibility.
- # Transformers should be passed arguments it knows.
- if self.version >= 15:
- version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")
+ # Extract reasoning_effort before passing to transformers, which
+ # does not accept it as a kwarg. Instead, we pass it directly to
+ # mistral-common's ChatCompletionRequest via from_openai().
+ reasoning_effort = kwargs.get("reasoning_effort")
messages, tools = _prepare_apply_chat_template_tools_and_messages(
messages, tools, continue_final_message, add_generation_prompt
)
+ # When reasoning_effort is set, bypass transformers' apply_chat_template
+ # (which rejects unknown kwargs) and use mistral-common directly.
+ if reasoning_effort is not None and self.version >= 15:
+ request_kwargs: dict[str, Any] = {
+ "reasoning_effort": reasoning_effort,
+ }
+ chat_request = MistralChatCompletionRequest.from_openai(
+ messages=messages,
+ tools=tools,
+ continue_final_message=continue_final_message,
+ **request_kwargs,
+ )
+ encoded = self.mistral.encode_chat_completion(chat_request)
+ result: str | list[int] = encoded.tokens if tokenize else encoded.text
+
+ if tokenize and truncation and max_length is not None:
+ assert isinstance(result, list)
+ result = result[:max_length]
+
+ return result # type: ignore[return-value]
+
return self.transformers_tokenizer.apply_chat_template(
conversation=messages,
tools=tools,
@@ -448,7 +468,6 @@ class MistralTokenizer(TokenizerLike):
max_length=max_length,
return_tensors=None,
return_dict=False,
- **version_kwargs,
)
def decode(
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_TEST_FORCE_FP8_MARLIN=1
vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
--max-model-len 150000
--tool-call-parser mistral
--tokenizer-mode mistral
--config-format mistral
--load-format mistral
--reasoning-parser mistral
--enable-auto-tool-choice
--reasoning-parser mistral
--max_num_batched_tokens 16384
--max_num_seqs 8
--gpu_memory_utilization 0.9
Performance is peaking around 30 tps during text generation. I’ll work on getting a real bench mark run tonight I hope.
llm-server | (APIServer pid=1) INFO 03-19 21:36:45 [loggers.py:259] Engine 000: Avg prompt throughput: 69.7 tokens/s, Avg generation throughput: 3.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.9%
vllm-server | (APIServer pid=1) INFO 03-19 21:36:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.9%
vllm-server | (APIServer pid=1) INFO: 10.0.1.172:57980 - "GET /v1/models HTTP/1.1" 200 OK
vllm-server | (APIServer pid=1) INFO: 10.0.1.172:57988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server | (APIServer pid=1) INFO 03-19 21:37:25 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 12.9%
vllm-server | (APIServer pid=1) INFO 03-19 21:37:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 12.9%
vllm-server | (APIServer pid=1) INFO: 10.0.1.172:53372 - "GET /v1/models HTTP/1.1" 200 OK
vllm-server | (APIServer pid=1) INFO: 10.0.1.172:53386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-server | (APIServer pid=1) INFO 03-19 21:37:45 [loggers.py:259] Engine 000: Avg prompt throughput: 45.1 tokens/s, Avg generation throughput: 19.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 24.5%