Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

this is fantastic to get it up and running, but do I understand it right that this essentially kills reasoning on this model until the support lands in mistral_common? While this model is nice and fast, it is a bit dumb sometimes and shoots from the hip far too often.

If you use the patch from drew22 then reasoning should work too

Well I gave the Eagle speculative config a shot. At the moment it doesn’t appear to be compatible with vLLM mistralai/Mistral-Small-4-119B-2603-eagle · vLLM does not load the eagle head .

Still hoping someone wants to take on autotune for this model on the dgx spark. Any takers?

Would you mind sharing quick steps you took to get this all running?

nvfp4 should be support in eugr’s turboquant image now. There’s a big mistral PR that also needs to be merged as well - it’ll appear earlier in the thread. Mistral is really, really fast but not nearly as accurate as nemotron 3 super or qwen 3.5-122b, unfortunately.

Did I miss eugr’s TurboQuant image? Happy to swap asap

2 Likes

No, to my knowledge it does not exist.

Thanks – got it running with instructions in the thread. Speed was unimpressive on two nodes.

1 Like

So I can’t seem to get Mistral TOOL_CALLS to not leak with the setup I described above. I have been watching the PRs by Juliendenize and [Mistral Grammar] Fix tool and reasoning parsing by juliendenize · Pull Request #39217 · vllm-project/vllm · GitHub seems to be the last PR needed to resolve this. Building using @eugr ‘s project though ./build-and-copy.sh --apply-vllm-pr 39217 fails.

ERROR: failed to build: failed to solve: process "/bin/sh -c curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/35568.diff -o pr35568.diff     && if git apply --reverse --check pr35568.diff 2>/dev/null; then          echo \"PR 35568 already applied, skipping.\";        else          echo \"Applying PR 35568...\";          git apply -v pr35568.diff;        fi     && rm pr35568.diff" did not complete successfully: exit code: 1
vLLM build failed — restoring previous wheels...

Has anyone else been able to get Mistral running with Tool calls appropriately parsed?

  vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
  --max-model-len 160000
  --tool-call-parser mistral
  --tokenizer-mode mistral
  --config-format mistral
  --load-format mistral
  --enable-auto-tool-choice
  --reasoning-parser mistral
  --max_num_batched_tokens 16384
  --max_num_seqs 8
  --gpu_memory_utilization 0.9
  --cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256
  --max-cudagraph-capture-size 256

I tried hermes tool-call-parser but get teknizer errors.

Also had same issues with tool-call-parser in the past for other mistral models


Ok, I pull latest spark-vllm-docker and was able to build with PR 39217. There was still an issue with the tokenizer/mistral.py so I added the following

      # NOTE: This is for backward compatibility.                                                                                                                                 
      # Transformers should be passed arguments it knows.                                                                                                                           
      if self.version >= 15:                             
          reasoning_effort = kwargs.get("reasoning_effort")                                                                                                                         
          if reasoning_effort is not None:                                                                                                                                        
              version_kwargs["reasoning_effort"] = reasoning_effort

Model loaded, but on first request I got this dump from vllm. Which is a first.

```
(EngineCore pid=159) ERROR 04-13 21:15:15 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev243+g995e9a209.d20260413) with config: model=‘mistralai/Mistral-Small-4-119B-2603-NVFP4’, speculative_config=None, tokenizer=‘mistralai/Mistral-Small-4-119B-2603-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=160000, download_dir=None, load_format=mistral, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘mistral’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=mistralai/Mistral-Small-4-119B-2603-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/111afa2bb9’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [16384], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.PIECEWISE: 1>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 32, 64, 128, 256], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 256, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/111afa2bb9/rank_0_0/backbone’, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: [‘language_model.model.layers.0.mlp.experts’, ‘language_model.model.layers.1.mlp.experts’, ‘language_model.model.layers.2.mlp.experts’, ‘language_model.model.layers.3.mlp.experts’, ‘language_model.model.layers.4.mlp.experts’, ‘language_model.model.layers.5.mlp.experts’, ‘language_model.model.layers.6.mlp.experts’, ‘language_model.model.layers.7.mlp.experts’, ‘language_model.model.layers.8.mlp.experts’, ‘language_model.model.layers.9.mlp.experts’, ‘language_model.model.layers.10.mlp.experts’, ‘language_model.model.layers.11.mlp.experts’, ‘language_model.model.layers.12.mlp.experts’, ‘language_model.model.layers.13.mlp.experts’, ‘language_model.model.layers.14.mlp.experts’, ‘language_model.model.layers.15.mlp.experts’, ‘language_model.model.layers.16.mlp.experts’, ‘language_model.model.layers.17.mlp.experts’, ‘language_model.model.layers.18.mlp.experts’, ‘language_model.model.layers.19.mlp.experts’, ‘language_model.model.layers.20.mlp.experts’, ‘language_model.model.layers.21.mlp.experts’, ‘language_model.model.layers.22.mlp.experts’, ‘language_model.model.layers.23.mlp.experts’, ‘language_model.model.layers.24.mlp.experts’, ‘language_model.model.layers.25.mlp.experts’, ‘language_model.model.layers.26.mlp.experts’, ‘language_model.model.layers.27.mlp.experts’, ‘language_model.model.layers.28.mlp.experts’, ‘language_model.model.layers.29.mlp.experts’, ‘language_model.model.layers.30.mlp.experts’, ‘language_model.model.layers.31.mlp.experts’, ‘language_model.model.layers.32.mlp.experts’, ‘language_model.model.layers.33.mlp.experts’, ‘language_model.model.layers.34.mlp.experts’, ‘language_model.model.layers.35.mlp.experts’]}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’]), enable_flashinfer_autotune=True, moe_backend=‘auto’),
(EngineCore pid=159) ERROR 04-13 21:15:15 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=, scheduled_cached_reqs=CachedRequestData(req_ids=[‘chatcmpl-a14dcb6cdadfad8c-90cf504f’],resumed_req_ids=set(),new_token_ids_lens=,all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[19],num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-a14dcb6cdadfad8c-90cf504f: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[2], finished_req_ids=, free_encoder_mm_hashes=, preempted_req_ids=, has_structured_output_requests=true, pending_structured_output_tokens=true, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=159) ERROR 04-13 21:15:15 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=2.6725105564118223e-05, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=19, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] EngineCore encountered a fatal error.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] Traceback (most recent call last):
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1103, in run_engine_core
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] engine_core.run_busy_loop()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1144, in run_busy_loop
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] self._process_engine_step()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1183, in _process_engine_step
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] outputs, model_executed = self.step_fn()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 453, in step_with_batch_queue
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] exec_future = self.model_executor.execute_model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 114, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] output.result()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 449, in result
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.__get_result()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 401, in __get_result
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] raise self._exception
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 84, in collective_rpc
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py”, line 510, in run_method
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.worker.execute_model(scheduler_output)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 808, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] output = self.model_runner.execute_model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 4038, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] model_output = self._model_forward(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 3519, in _model_forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py”, line 254, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.runnable(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._call_impl(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in _call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return forward_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py”, line 431, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] hidden_states = self.language_model.model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py”, line 480, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py”, line 224, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.fn(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py”, line 1228, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] def forward(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py”, line 211, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.optimized_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] raise e
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return forward_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “<eval_with_key>.149”, line 377, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] submod_3 = self.submod_3(getitem_2, s59, getitem, getitem_1, getitem_3, synthetic_local_tmp_0
, submod_1); getitem_2 = getitem = getitem_1 = synthetic_local_tmp_0 = submod_1 = submod_3 = None
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] raise e
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return forward_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “<eval_with_key>.152”, line 5, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output, synthetic_local_tmp_0
, kv_cache_dummy_dep = kv_cache_dummy_dep); q_1 = kv_c_normed = key_rot_1 = output = synthetic_local_tmp_0 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/_ops.py”, line 1269, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._op(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py”, line 40, in wrapper
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 983, in unified_mla_attention_with_output
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] layer.forward_impl(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 698, in forward_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py”, line 196, in forward_mqa
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] decode_attention_fwd(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 762, in decode_attention_fwd
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] decode_attention_fwd_grouped(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 696, in decode_attention_fwd_grouped
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] _decode_grouped_att_m_fwd(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 500, in _decode_grouped_att_m_fwd
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] _fwd_grouped_kernel_stage1[grid](
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 370, in
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 720, in run
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 849, in _do_compile
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] kernel = self.compile(src, target=target, options=options.dict)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 304, in compile
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] module = src.make_ir(target, options, codegen_fns, module_map, context)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 80, in make_ir
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] triton.compiler.errors.CompilationError: at 152:12:
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] v = (v.to(tl.float32) * vs).to(q.dtype)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] else:
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] # MLA uses a single c_kv.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] # loading the same c_kv to interpret it as v is not necessary.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] # transpose the existing c_kv (aka k) for the dot product.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] v = tl.trans(k)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] re_scale = tl.exp(e_max - n_e_max)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] p = tl.exp(qk - n_e_max[:, None])
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] acc *= re_scale[:, None]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] acc += tl.dot(p.to(v.dtype), v)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ValueError(‘Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512’)
(EngineCore pid=159) Process EngineCore:
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] AsyncLLM output_handler failed.
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] Traceback (most recent call last):
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 657, in output_handler
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] outputs = await engine_core.get_output_async()
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 998, in get_output_async
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] raise self._format_exception(outputs) from None
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore pid=159) Traceback (most recent call last):
(EngineCore pid=159) File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore pid=159) self.run()
(EngineCore pid=159) File “/usr/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore pid=159) self._target(*self._args, **self._kwargs)
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1114, in run_engine_core
(EngineCore pid=159) raise e
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1103, in run_engine_core
(EngineCore pid=159) engine_core.run_busy_loop()
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1144, in run_busy_loop
(EngineCore pid=159) self._process_engine_step()
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1183, in _process_engine_step
(EngineCore pid=159) outputs, model_executed = self.step_fn()
(EngineCore pid=159) ^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 453, in step_with_batch_queue
(EngineCore pid=159) exec_future = self.model_executor.execute_model(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 114, in execute_model
(EngineCore pid=159) output.result()
(EngineCore pid=159) File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 449, in result
(EngineCore pid=159) return self.__get_result()
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 401, in __get_result
(EngineCore pid=159) raise self._exception
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 84, in collective_rpc
(EngineCore pid=159) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py”, line 510, in run_method
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_model
(EngineCore pid=159) return self.worker.execute_model(scheduler_output)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 808, in execute_model
(EngineCore pid=159) output = self.model_runner.execute_model(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 4038, in execute_model
(EngineCore pid=159) model_output = self._model_forward(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 3519, in _model_forward
(EngineCore pid=159) return self.model(
(EngineCore pid=159) ^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py”, line 254, in call
(EngineCore pid=159) return self.runnable(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) return self._call_impl(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in _call_impl
(EngineCore pid=159) return forward_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py”, line 431, in forward
(EngineCore pid=159) hidden_states = self.language_model.model(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py”, line 480, in call
(EngineCore pid=159) return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py”, line 224, in call
(EngineCore pid=159) return self.fn(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py”, line 1228, in forward
(EngineCore pid=159) def forward(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py”, line 211, in call
(EngineCore pid=159) return self.optimized_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) raise e
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) return forward_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “<eval_with_key>.149”, line 377, in forward
(EngineCore pid=159) submod_3 = self.submod_3(getitem_2, s59, getitem, getitem_1, getitem_3, synthetic_local_tmp_0
, submod_1); getitem_2 = getitem = getitem_1 = synthetic_local_tmp_0 = submod_1 = submod_3 = None
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) raise e
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) return forward_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “<eval_with_key>.152”, line 5, in forward
(EngineCore pid=159) unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output, synthetic_local_tmp_0
, kv_cache_dummy_dep = kv_cache_dummy_dep); q_1 = kv_c_normed = key_rot_1 = output = synthetic_local_tmp_0 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/_ops.py”, line 1269, in call
(EngineCore pid=159) return self._op(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py”, line 40, in wrapper
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 983, in unified_mla_attention_with_output
(EngineCore pid=159) layer.forward_impl(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 698, in forward_impl
(EngineCore pid=159) attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py”, line 196, in forward_mqa
(EngineCore pid=159) decode_attention_fwd(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 762, in decode_attention_fwd
(EngineCore pid=159) decode_attention_fwd_grouped(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 696, in decode_attention_fwd_grouped
(EngineCore pid=159) _decode_grouped_att_m_fwd(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 500, in _decode_grouped_att_m_fwd
(EngineCore pid=159) _fwd_grouped_kernel_stage1[grid](
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 370, in
(EngineCore pid=159) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 720, in run
(EngineCore pid=159) kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 849, in _do_compile
(EngineCore pid=159) kernel = self.compile(src, target=target, options=options.dict)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 304, in compile
(EngineCore pid=159) module = src.make_ir(target, options, codegen_fns, module_map, context)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 80, in make_ir
(EngineCore pid=159) return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) triton.compiler.errors.CompilationError: at 152:12:
(EngineCore pid=159) v = (v.to(tl.float32) * vs).to(q.dtype)
(EngineCore pid=159) else:
(EngineCore pid=159) # MLA uses a single c_kv.
(EngineCore pid=159) # loading the same c_kv to interpret it as v is not necessary.
(EngineCore pid=159) # transpose the existing c_kv (aka k) for the dot product.
(EngineCore pid=159) v = tl.trans(k)
(EngineCore pid=159)
(EngineCore pid=159) n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(EngineCore pid=159) re_scale = tl.exp(e_max - n_e_max)
(EngineCore pid=159) p = tl.exp(qk - n_e_max[:, None])
(EngineCore pid=159) acc *= re_scale[:, None]
(EngineCore pid=159) acc += tl.dot(p.to(v.dtype), v)
(EngineCore pid=159) ^
(EngineCore pid=159) ValueError(‘Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512’)
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] Error in chat completion stream generator.
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] Traceback (most recent call last):
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py”, line 602, in chat_completion_stream_generator
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] async for res in result_generator:
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 576, in generate
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] out = q.get_nowait() or await q.get()
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] ^^^^^^^^^^^^^
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py”, line 85, in get
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] raise output
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 657, in output_handler
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] outputs = await engine_core.get_output_async()
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 998, in get_output_async
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] raise self._format_exception(outputs) from None
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=80) INFO: 10.0.1.172:60796 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
```

Good news everyone!

I have a build of vLLM using GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub that is working with https://github.com/vllm-project/vllm/pull/39217. This one resolves the tool calling issues I was seeing in Opencode.

If anyone wants to just use my image you can:

docker pull androiddrew/mistral4-vllm-spark:26-04-14

There were two additional fixes that were needed on top of applying https://github.com/vllm-project/vllm/pull/39217 via build-and-copy.sh --apply-vllm-pr 39217

diff --git a/vllm/v1/attention/ops/triton_decode_attention.py b/vllm/v1/attention/ops/triton_decode_attention.py
index 8118db0da..347dfcc07 100644
--- a/vllm/v1/attention/ops/triton_decode_attention.py
+++ b/vllm/v1/attention/ops/triton_decode_attention.py
@@ -467,7 +467,14 @@ def _decode_grouped_att_m_fwd(
     if is_hip_ and Lk >= 576:
         BLOCK = 16
 
-    if Lk == 576:
+    if is_mla and Lk > Lv:
+        # MLA: KV cache stores [c_kv || k_pe] concatenated.
+        # Split into nope (BLOCK_DMODEL = kv_lora_rank) and rope (BLOCK_DPE)
+        # so the kernel loads them separately and v = trans(k_nope) matches
+        # the accumulator dimension (BLOCK_DV).
+        BLOCK_DMODEL = triton.next_power_of_2(Lv)
+        BLOCK_DPE = triton.next_power_of_2(Lk - Lv)
+    elif Lk == 576:
         BLOCK_DMODEL = 512
         BLOCK_DPE = 64
     elif Lk == 288:

and

diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index 3a79fbb1a..7667903e4 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -447,7 +447,9 @@ class MistralTokenizer(TokenizerLike):
         # NOTE: This is for backward compatibility.
         # Transformers should be passed arguments it knows.
         if self.version >= 15:
-            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")
+            reasoning_effort = kwargs.get("reasoning_effort")
+            if reasoning_effort is not None:
+                version_kwargs["reasoning_effort"] = reasoning_effort
 
         messages, tools = _prepare_apply_chat_template_tools_and_messages(
             messages, tools, continue_final_message, add_generation_prompt
3 Likes

Great work! Is your docker derivative of Eugr’s such that the same parameters work?

I don’t know what parameters you are referring to, I have just been using it as a convenient way to build a vLLM with patches for the GB10. But yeah it’s a @eugr based container like I stated above.

1 Like

Do you feel the Mistral Small 4 is better than Qwen 3.5 122B?

Thank you @drew22 ,
which parameters you use for vllm please ?
Those

vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
–max-model-len 160000
–tool-call-parser mistral
–tokenizer-mode mistral
–config-format mistral
–load-format mistral
–enable-auto-tool-choice
–reasoning-parser mistral
–max_num_batched_tokens 16384
–max_num_seqs 8
–gpu_memory_utilization 0.9
–cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256
–max-cudagraph-capture-size 256