Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

glbauer · March 25, 2026, 1:59pm

this is fantastic to get it up and running, but do I understand it right that this essentially kills reasoning on this model until the support lands in mistral_common? While this model is nice and fast, it is a bit dumb sometimes and shoots from the hip far too often.

patrik.torstensson · March 27, 2026, 11:27am

If you use the patch from drew22 then reasoning should work too

drew22 · March 29, 2026, 12:27pm

Well I gave the Eagle speculative config a shot. At the moment it doesn’t appear to be compatible with vLLM mistralai/Mistral-Small-4-119B-2603-eagle · vLLM does not load the eagle head .

Still hoping someone wants to take on autotune for this model on the dgx spark. Any takers?

josephbreda · March 29, 2026, 6:15pm

Would you mind sharing quick steps you took to get this all running?

tenari · April 1, 2026, 8:01pm

nvfp4 should be support in eugr’s turboquant image now. There’s a big mistral PR that also needs to be merged as well - it’ll appear earlier in the thread. Mistral is really, really fast but not nearly as accurate as nemotron 3 super or qwen 3.5-122b, unfortunately.

serapis · April 2, 2026, 5:05am

Did I miss eugr’s TurboQuant image? Happy to swap asap

stefan132 · April 2, 2026, 12:04pm

No, to my knowledge it does not exist.

josephbreda · April 2, 2026, 12:22pm

Thanks – got it running with instructions in the thread. Speed was unimpressive on two nodes.

drew22 · April 13, 2026, 3:15pm

So I can’t seem to get Mistral TOOL_CALLS to not leak with the setup I described above. I have been watching the PRs by Juliendenize and [Mistral Grammar] Fix tool and reasoning parsing by juliendenize · Pull Request #39217 · vllm-project/vllm · GitHub seems to be the last PR needed to resolve this. Building using @eugr ‘s project though ./build-and-copy.sh --apply-vllm-pr 39217 fails.

ERROR: failed to build: failed to solve: process "/bin/sh -c curl -fsL https://patch-diff.githubusercontent.com/raw/vllm-project/vllm/pull/35568.diff -o pr35568.diff     && if git apply --reverse --check pr35568.diff 2>/dev/null; then          echo \"PR 35568 already applied, skipping.\";        else          echo \"Applying PR 35568...\";          git apply -v pr35568.diff;        fi     && rm pr35568.diff" did not complete successfully: exit code: 1
vLLM build failed — restoring previous wheels...

Has anyone else been able to get Mistral running with Tool calls appropriately parsed?

  vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
  --max-model-len 160000
  --tool-call-parser mistral
  --tokenizer-mode mistral
  --config-format mistral
  --load-format mistral
  --enable-auto-tool-choice
  --reasoning-parser mistral
  --max_num_batched_tokens 16384
  --max_num_seqs 8
  --gpu_memory_utilization 0.9
  --cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256
  --max-cudagraph-capture-size 256

I tried hermes tool-call-parser but get teknizer errors.

carlos.albarran.mx · April 13, 2026, 8:08pm

Also had same issues with tool-call-parser in the past for other mistral models…

drew22 · April 13, 2026, 9:20pm

Ok, I pull latest spark-vllm-docker and was able to build with PR 39217. There was still an issue with the tokenizer/mistral.py so I added the following

      # NOTE: This is for backward compatibility.                                                                                                                                 
      # Transformers should be passed arguments it knows.                                                                                                                           
      if self.version >= 15:                             
          reasoning_effort = kwargs.get("reasoning_effort")                                                                                                                         
          if reasoning_effort is not None:                                                                                                                                        
              version_kwargs["reasoning_effort"] = reasoning_effort

Model loaded, but on first request I got this dump from vllm. Which is a first.

```
(EngineCore pid=159) ERROR 04-13 21:15:15 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev243+g995e9a209.d20260413) with config: model=‘mistralai/Mistral-Small-4-119B-2603-NVFP4’, speculative_config=None, tokenizer=‘mistralai/Mistral-Small-4-119B-2603-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=mistral, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=160000, download_dir=None, load_format=mistral, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘mistral’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=mistralai/Mistral-Small-4-119B-2603-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/111afa2bb9’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘ir_enable_torch_wrap’: True, ‘splitting_ops’: [‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [16384], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.PIECEWISE: 1>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 32, 64, 128, 256], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 256, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/111afa2bb9/rank_0_0/backbone’, ‘fast_moe_cold_start’: False, ‘static_all_moe_layers’: [‘language_model.model.layers.0.mlp.experts’, ‘language_model.model.layers.1.mlp.experts’, ‘language_model.model.layers.2.mlp.experts’, ‘language_model.model.layers.3.mlp.experts’, ‘language_model.model.layers.4.mlp.experts’, ‘language_model.model.layers.5.mlp.experts’, ‘language_model.model.layers.6.mlp.experts’, ‘language_model.model.layers.7.mlp.experts’, ‘language_model.model.layers.8.mlp.experts’, ‘language_model.model.layers.9.mlp.experts’, ‘language_model.model.layers.10.mlp.experts’, ‘language_model.model.layers.11.mlp.experts’, ‘language_model.model.layers.12.mlp.experts’, ‘language_model.model.layers.13.mlp.experts’, ‘language_model.model.layers.14.mlp.experts’, ‘language_model.model.layers.15.mlp.experts’, ‘language_model.model.layers.16.mlp.experts’, ‘language_model.model.layers.17.mlp.experts’, ‘language_model.model.layers.18.mlp.experts’, ‘language_model.model.layers.19.mlp.experts’, ‘language_model.model.layers.20.mlp.experts’, ‘language_model.model.layers.21.mlp.experts’, ‘language_model.model.layers.22.mlp.experts’, ‘language_model.model.layers.23.mlp.experts’, ‘language_model.model.layers.24.mlp.experts’, ‘language_model.model.layers.25.mlp.experts’, ‘language_model.model.layers.26.mlp.experts’, ‘language_model.model.layers.27.mlp.experts’, ‘language_model.model.layers.28.mlp.experts’, ‘language_model.model.layers.29.mlp.experts’, ‘language_model.model.layers.30.mlp.experts’, ‘language_model.model.layers.31.mlp.experts’, ‘language_model.model.layers.32.mlp.experts’, ‘language_model.model.layers.33.mlp.experts’, ‘language_model.model.layers.34.mlp.experts’, ‘language_model.model.layers.35.mlp.experts’]}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=[‘native’]), enable_flashinfer_autotune=True, moe_backend=‘auto’),
(EngineCore pid=159) ERROR 04-13 21:15:15 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=, scheduled_cached_reqs=CachedRequestData(req_ids=[‘chatcmpl-a14dcb6cdadfad8c-90cf504f’],resumed_req_ids=set(),new_token_ids_lens=,all_token_ids_lens={},new_block_ids=[None],num_computed_tokens=[19],num_output_tokens=[1]), num_scheduled_tokens={chatcmpl-a14dcb6cdadfad8c-90cf504f: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[2], finished_req_ids=, free_encoder_mm_hashes=, preempted_req_ids=, has_structured_output_requests=true, pending_structured_output_tokens=true, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=159) ERROR 04-13 21:15:15 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=2.6725105564118223e-05, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=19, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] EngineCore encountered a fatal error.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] Traceback (most recent call last):
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1103, in run_engine_core
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] engine_core.run_busy_loop()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1144, in run_busy_loop
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] self._process_engine_step()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1183, in _process_engine_step
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] outputs, model_executed = self.step_fn()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 453, in step_with_batch_queue
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] exec_future = self.model_executor.execute_model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 114, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] output.result()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 449, in result
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.__get_result()
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 401, in __get_result
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] raise self._exception
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 84, in collective_rpc
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py”, line 510, in run_method
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.worker.execute_model(scheduler_output)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 808, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] output = self.model_runner.execute_model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 4038, in execute_model
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] model_output = self._model_forward(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 3519, in _model_forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py”, line 254, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.runnable(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._call_impl(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in _call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return forward_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py”, line 431, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] hidden_states = self.language_model.model(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py”, line 480, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py”, line 224, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.fn(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py”, line 1228, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] def forward(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py”, line 211, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.optimized_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] raise e
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return forward_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “<eval_with_key>.149”, line 377, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] submod_3 = self.submod_3(getitem_2, s59, getitem, getitem_1, getitem_3, synthetic_local_tmp_0, submod_1); getitem_2 = getitem = getitem_1 = synthetic_local_tmp_0 = submod_1 = submod_3 = None
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] raise e
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return forward_call(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “<eval_with_key>.152”, line 5, in forward
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output, synthetic_local_tmp_0, kv_cache_dummy_dep = kv_cache_dummy_dep); q_1 = kv_c_normed = key_rot_1 = output = synthetic_local_tmp_0 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/torch/_ops.py”, line 1269, in call
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return self._op(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py”, line 40, in wrapper
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return func(*args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 983, in unified_mla_attention_with_output
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] layer.forward_impl(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 698, in forward_impl
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py”, line 196, in forward_mqa
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] decode_attention_fwd(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 762, in decode_attention_fwd
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] decode_attention_fwd_grouped(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 696, in decode_attention_fwd_grouped
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] _decode_grouped_att_m_fwd(
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 500, in _decode_grouped_att_m_fwd
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] _fwd_grouped_kernel_stage1[grid](
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 370, in
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 720, in run
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 849, in _do_compile
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] kernel = self.compile(src, target=target, options=options.dict)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 304, in compile
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] module = src.make_ir(target, options, codegen_fns, module_map, context)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 80, in make_ir
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] triton.compiler.errors.CompilationError: at 152:12:
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] v = (v.to(tl.float32) * vs).to(q.dtype)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] else:
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] # MLA uses a single c_kv.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] # loading the same c_kv to interpret it as v is not necessary.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] # transpose the existing c_kv (aka k) for the dot product.
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] v = tl.trans(k)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] re_scale = tl.exp(e_max - n_e_max)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] p = tl.exp(qk - n_e_max[:, None])
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] acc *= re_scale[:, None]
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] acc += tl.dot(p.to(v.dtype), v)
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ^
(EngineCore pid=159) ERROR 04-13 21:15:15 [core.py:1112] ValueError(‘Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512’)
(EngineCore pid=159) Process EngineCore:
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] AsyncLLM output_handler failed.
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] Traceback (most recent call last):
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 657, in output_handler
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] outputs = await engine_core.get_output_async()
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 998, in get_output_async
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] raise self._format_exception(outputs) from None
(APIServer pid=80) ERROR 04-13 21:15:15 [async_llm.py:701] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore pid=159) Traceback (most recent call last):
(EngineCore pid=159) File “/usr/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore pid=159) self.run()
(EngineCore pid=159) File “/usr/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore pid=159) self._target(*self._args, **self._kwargs)
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1114, in run_engine_core
(EngineCore pid=159) raise e
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1103, in run_engine_core
(EngineCore pid=159) engine_core.run_busy_loop()
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1144, in run_busy_loop
(EngineCore pid=159) self._process_engine_step()
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1183, in _process_engine_step
(EngineCore pid=159) outputs, model_executed = self.step_fn()
(EngineCore pid=159) ^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 453, in step_with_batch_queue
(EngineCore pid=159) exec_future = self.model_executor.execute_model(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 114, in execute_model
(EngineCore pid=159) output.result()
(EngineCore pid=159) File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 449, in result
(EngineCore pid=159) return self.__get_result()
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/lib/python3.12/concurrent/futures/_base.py”, line 401, in __get_result
(EngineCore pid=159) raise self._exception
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py”, line 84, in collective_rpc
(EngineCore pid=159) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py”, line 510, in run_method
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py”, line 332, in execute_model
(EngineCore pid=159) return self.worker.execute_model(scheduler_output)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py”, line 808, in execute_model
(EngineCore pid=159) output = self.model_runner.execute_model(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 4038, in execute_model
(EngineCore pid=159) model_output = self._model_forward(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py”, line 3519, in _model_forward
(EngineCore pid=159) return self.model(
(EngineCore pid=159) ^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py”, line 254, in call
(EngineCore pid=159) return self.runnable(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) return self._call_impl(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in _call_impl
(EngineCore pid=159) return forward_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py”, line 431, in forward
(EngineCore pid=159) hidden_states = self.language_model.model(
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py”, line 480, in call
(EngineCore pid=159) return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py”, line 224, in call
(EngineCore pid=159) return self.fn(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py”, line 1228, in forward
(EngineCore pid=159) def forward(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py”, line 211, in call
(EngineCore pid=159) return self.optimized_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) raise e
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) return forward_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “<eval_with_key>.149”, line 377, in forward
(EngineCore pid=159) submod_3 = self.submod_3(getitem_2, s59, getitem, getitem_1, getitem_3, synthetic_local_tmp_0, submod_1); getitem_2 = getitem = getitem_1 = synthetic_local_tmp_0 = submod_1 = submod_3 = None
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 949, in call_wrapped
(EngineCore pid=159) return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 461, in call
(EngineCore pid=159) raise e
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py”, line 447, in call
(EngineCore pid=159) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1779, in _wrapped_call_impl
(EngineCore pid=159) return self.call_impl(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1790, in call_impl
(EngineCore pid=159) return forward_call(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “<eval_with_key>.152”, line 5, in forward
(EngineCore pid=159) unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output, synthetic_local_tmp_0, kv_cache_dummy_dep = kv_cache_dummy_dep); q_1 = kv_c_normed = key_rot_1 = output = synthetic_local_tmp_0 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/torch/_ops.py”, line 1269, in call
(EngineCore pid=159) return self._op(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py”, line 40, in wrapper
(EngineCore pid=159) return func(*args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 983, in unified_mla_attention_with_output
(EngineCore pid=159) layer.forward_impl(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py”, line 698, in forward_impl
(EngineCore pid=159) attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py”, line 196, in forward_mqa
(EngineCore pid=159) decode_attention_fwd(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 762, in decode_attention_fwd
(EngineCore pid=159) decode_attention_fwd_grouped(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 696, in decode_attention_fwd_grouped
(EngineCore pid=159) _decode_grouped_att_m_fwd(
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py”, line 500, in _decode_grouped_att_m_fwd
(EngineCore pid=159) _fwd_grouped_kernel_stage1[grid](
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 370, in
(EngineCore pid=159) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 720, in run
(EngineCore pid=159) kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py”, line 849, in _do_compile
(EngineCore pid=159) kernel = self.compile(src, target=target, options=options.dict)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 304, in compile
(EngineCore pid=159) module = src.make_ir(target, options, codegen_fns, module_map, context)
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) File “/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py”, line 80, in make_ir
(EngineCore pid=159) return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(EngineCore pid=159) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=159) triton.compiler.errors.CompilationError: at 152:12:
(EngineCore pid=159) v = (v.to(tl.float32) * vs).to(q.dtype)
(EngineCore pid=159) else:
(EngineCore pid=159) # MLA uses a single c_kv.
(EngineCore pid=159) # loading the same c_kv to interpret it as v is not necessary.
(EngineCore pid=159) # transpose the existing c_kv (aka k) for the dot product.
(EngineCore pid=159) v = tl.trans(k)
(EngineCore pid=159)
(EngineCore pid=159) n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(EngineCore pid=159) re_scale = tl.exp(e_max - n_e_max)
(EngineCore pid=159) p = tl.exp(qk - n_e_max[:, None])
(EngineCore pid=159) acc *= re_scale[:, None]
(EngineCore pid=159) acc += tl.dot(p.to(v.dtype), v)
(EngineCore pid=159) ^
(EngineCore pid=159) ValueError(‘Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512’)
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] Error in chat completion stream generator.
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] Traceback (most recent call last):
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py”, line 602, in chat_completion_stream_generator
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] async for res in result_generator:
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 576, in generate
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] out = q.get_nowait() or await q.get()
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] ^^^^^^^^^^^^^
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py”, line 85, in get
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] raise output
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 657, in output_handler
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] outputs = await engine_core.get_output_async()
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 998, in get_output_async
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] raise self._format_exception(outputs) from None
(APIServer pid=80) ERROR 04-13 21:15:15 [serving.py:1307] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=80) INFO: 10.0.1.172:60796 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
```

drew22 · April 15, 2026, 1:29am

Good news everyone!

I have a build of vLLM using GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub that is working with https://github.com/vllm-project/vllm/pull/39217. This one resolves the tool calling issues I was seeing in Opencode.

If anyone wants to just use my image you can:

docker pull androiddrew/mistral4-vllm-spark:26-04-14

There were two additional fixes that were needed on top of applying https://github.com/vllm-project/vllm/pull/39217 via build-and-copy.sh --apply-vllm-pr 39217

diff --git a/vllm/v1/attention/ops/triton_decode_attention.py b/vllm/v1/attention/ops/triton_decode_attention.py
index 8118db0da..347dfcc07 100644
--- a/vllm/v1/attention/ops/triton_decode_attention.py
+++ b/vllm/v1/attention/ops/triton_decode_attention.py
@@ -467,7 +467,14 @@ def _decode_grouped_att_m_fwd(
     if is_hip_ and Lk >= 576:
         BLOCK = 16
 
-    if Lk == 576:
+    if is_mla and Lk > Lv:
+        # MLA: KV cache stores [c_kv || k_pe] concatenated.
+        # Split into nope (BLOCK_DMODEL = kv_lora_rank) and rope (BLOCK_DPE)
+        # so the kernel loads them separately and v = trans(k_nope) matches
+        # the accumulator dimension (BLOCK_DV).
+        BLOCK_DMODEL = triton.next_power_of_2(Lv)
+        BLOCK_DPE = triton.next_power_of_2(Lk - Lv)
+    elif Lk == 576:
         BLOCK_DMODEL = 512
         BLOCK_DPE = 64
     elif Lk == 288:

and

diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py
index 3a79fbb1a..7667903e4 100644
--- a/vllm/tokenizers/mistral.py
+++ b/vllm/tokenizers/mistral.py
@@ -447,7 +447,9 @@ class MistralTokenizer(TokenizerLike):
         # NOTE: This is for backward compatibility.
         # Transformers should be passed arguments it knows.
         if self.version >= 15:
-            version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort")
+            reasoning_effort = kwargs.get("reasoning_effort")
+            if reasoning_effort is not None:
+                version_kwargs["reasoning_effort"] = reasoning_effort
 
         messages, tools = _prepare_apply_chat_template_tools_and_messages(
             messages, tools, continue_final_message, add_generation_prompt

github.com/vllm-project/vllm

Comment by androiddrew - [Mistral Grammar] Fix tool and reasoning parsing

main ← juliendenize:fix_parsing_on_top_of_grammar

@bbrowning Thanks for the heads up ``` diff --git a/vllm/v1/attention/ops/tr…iton_decode_attention.py b/vllm/v1/attention/ops/triton_decode_attention.py index 8118db0da..347dfcc07 100644 --- a/vllm/v1/attention/ops/triton_decode_attention.py +++ b/vllm/v1/attention/ops/triton_decode_attention.py @@ -467,7 +467,14 @@ def _decode_grouped_att_m_fwd( if is_hip_ and Lk >= 576: BLOCK = 16 - if Lk == 576: + if is_mla and Lk > Lv: + # MLA: KV cache stores [c_kv || k_pe] concatenated. + # Split into nope (BLOCK_DMODEL = kv_lora_rank) and rope (BLOCK_DPE) + # so the kernel loads them separately and v = trans(k_nope) matches + # the accumulator dimension (BLOCK_DV). + BLOCK_DMODEL = triton.next_power_of_2(Lv) + BLOCK_DPE = triton.next_power_of_2(Lk - Lv) + elif Lk == 576: BLOCK_DMODEL = 512 BLOCK_DPE = 64 elif Lk == 288: ``` This and a small change to @juliendenize vllm/tokenizers/mistral.py appears to have resolved my issue. I can now successfully load mistral4 and make tool calls in Opencode on my DGX Spark ``` diff --git a/vllm/tokenizers/mistral.py b/vllm/tokenizers/mistral.py index 3a79fbb1a..7667903e4 100644 --- a/vllm/tokenizers/mistral.py +++ b/vllm/tokenizers/mistral.py @@ -447,7 +447,9 @@ class MistralTokenizer(TokenizerLike): # NOTE: This is for backward compatibility. # Transformers should be passed arguments it knows. if self.version >= 15: - version_kwargs["reasoning_effort"] = kwargs.get("reasoning_effort") + reasoning_effort = kwargs.get("reasoning_effort") + if reasoning_effort is not None: + version_kwargs["reasoning_effort"] = reasoning_effort messages, tools = _prepare_apply_chat_template_tools_and_messages( messages, tools, continue_final_message, add_generation_prompt ```

josephbreda · April 15, 2026, 9:19pm

Great work! Is your docker derivative of Eugr’s such that the same parameters work?

drew22 · April 16, 2026, 12:26pm

I don’t know what parameters you are referring to, I have just been using it as a convenient way to build a vLLM with patches for the GB10. But yeah it’s a @eugr based container like I stated above.

djordjestojanovic1992 · April 16, 2026, 2:40pm

Do you feel the Mistral Small 4 is better than Qwen 3.5 122B?

pierre.ldff · April 17, 2026, 8:44am

Thank you @drew22 ,
which parameters you use for vllm please ?
Those

vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4
–max-model-len 160000
–tool-call-parser mistral
–tokenizer-mode mistral
–config-format mistral
–load-format mistral
–enable-auto-tool-choice
–reasoning-parser mistral
–max_num_batched_tokens 16384
–max_num_seqs 8
–gpu_memory_utilization 0.9
–cudagraph-capture-sizes 1 2 4 8 16 32 64 128 256
–max-cudagraph-capture-size 256

Topic		Replies	Views
Running Mistral Small 4 (119B MoE) on DGX Spark with SGLang — Full Setup & Benchmarks DGX Spark / GB10 agentic-ai	3	500	March 26, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	4202	February 27, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6435	March 28, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2274	December 25, 2025
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	6289	March 10, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4877	December 9, 2025
vLLM containers DGX Spark / GB10	44	1365	March 28, 2026
Run VLLM in Spark DGX Spark / GB10	145	11667	April 1, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2788	December 31, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1199	February 13, 2026

Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10)

Related topics