I am tried both the available models of 35B but getting same error
WARNING 04-13 02:37:02 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299]
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299] β β ββ ββ
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299] ββ ββ β β β βββ β version 0.19.0
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299] ββββ β β β β model /local_models/qwen35-35b-fp8-mtp
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299] ββ βββββ βββββ β β
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:299]
(APIServer pid=1) INFO 04-13 02:37:02 [utils.py:233] non-default args: {'model_tag': '/local_models/qwen35-35b-fp8-mtp', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/local_models/qwen35-35b-fp8-mtp', 'max_model_len': 131072, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.92, 'language_model_only': True, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1) WARNING 04-13 02:37:02 [envs.py:1744] Unknown vLLM environment variable detected: VLLM_UF_EAGER_ALLREDUCE
(APIServer pid=1) INFO 04-13 02:37:07 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 04-13 02:37:07 [model.py:1678] Using max model len 131072
(APIServer pid=1) WARNING 04-13 02:37:07 [speculative.py:368] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=1) INFO 04-13 02:37:11 [model.py:549] Resolved architecture: Qwen3_5MoeMTP
(APIServer pid=1) INFO 04-13 02:37:11 [model.py:1678] Using max model len 262144
(APIServer pid=1) WARNING 04-13 02:37:11 [speculative.py:512] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:435: UserWarning:
(APIServer pid=1) Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=1) Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=1) (8.0) - (12.0)
(APIServer pid=1)
(APIServer pid=1) queued_call()
(APIServer pid=1) INFO 04-13 02:37:12 [config.py:281] Setting attention block size to 1072 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 04-13 02:37:12 [config.py:312] Padding mamba page size by 0.75% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 04-13 02:37:12 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-13 02:37:12 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=1) INFO 04-13 02:37:12 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=188) INFO 04-13 02:37:17 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/local_models/qwen35-35b-fp8-mtp', speculative_config=SpeculativeConfig(method='mtp', model='/local_models/qwen35-35b-fp8-mtp', num_spec_tokens=2), tokenizer='/local_models/qwen35-35b-fp8-mtp', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/local_models/qwen35-35b-fp8-mtp, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=188) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:435: UserWarning:
(EngineCore pid=188) Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore pid=188) Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore pid=188) (8.0) - (12.0)
(EngineCore pid=188)
(EngineCore pid=188) queued_call()
(EngineCore pid=188) INFO 04-13 02:37:18 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=188) INFO 04-13 02:37:18 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.5.35:48121 backend=nccl
(EngineCore pid=188) INFO 04-13 02:37:18 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=188) WARNING 04-13 02:37:19 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=188) INFO 04-13 02:37:19 [gpu_model_runner.py:4735] Starting to load model /local_models/qwen35-35b-fp8-mtp...
(EngineCore pid=188) INFO 04-13 02:37:19 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=188) INFO 04-13 02:37:19 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=188) INFO 04-13 02:37:19 [__init__.py:261] Selected CutlassFP8ScaledMMLinearKernel for Fp8LinearMethod
(EngineCore pid=188) INFO 04-13 02:37:19 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
(EngineCore pid=188) INFO 04-13 02:37:21 [fp8.py:396] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'TRITON', 'MARLIN', 'BATCHED_DEEPGEMM', 'BATCHED_TRITON', 'XPU'].
(EngineCore pid=188) INFO 04-13 02:37:22 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=188) INFO 04-13 02:37:22 [flash_attn.py:596] Using FlashAttention version 2
(EngineCore pid=188) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=188) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:05<01:05, 5.08s/it]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:10<01:05, 5.42s/it]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:18<01:12, 6.59s/it]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:23<00:58, 5.86s/it]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:29<00:52, 5.79s/it]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:34<00:45, 5.65s/it]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:39<00:38, 5.44s/it]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:44<00:31, 5.32s/it]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [01:18<01:11, 14.26s/it]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [01:24<00:46, 11.65s/it]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [01:30<00:29, 9.97s/it]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [01:37<00:18, 9.14s/it]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [02:00<00:13, 13.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:02<00:00, 9.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [02:02<00:00, 8.75s/it]
(EngineCore pid=188)
(EngineCore pid=188) INFO 04-13 02:39:29 [default_loader.py:384] Loading weights took 122.58 seconds
(EngineCore pid=188) INFO 04-13 02:39:29 [fp8.py:560] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=188) INFO 04-13 02:39:29 [gpu_model_runner.py:4759] Loading drafter model...
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:00<00:01, 11.68it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:00<00:00, 11.55it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:00<00:00, 11.58it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:00<00:00, 11.75it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:16<00:00, 11.75it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:18<00:20, 4.01s/it]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:18<00:07, 2.50s/it]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:19<00:01, 1.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:27<00:00, 2.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:27<00:00, 1.95s/it]
(EngineCore pid=188)
(EngineCore pid=188) INFO 04-13 02:39:57 [default_loader.py:384] Loading weights took 27.37 seconds
(EngineCore pid=188) INFO 04-13 02:39:57 [eagle.py:1376] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=188) INFO 04-13 02:39:57 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=188) INFO 04-13 02:39:57 [gpu_model_runner.py:4820] Model loading took 34.18 GiB memory and 157.231340 seconds
(EngineCore pid=188) INFO 04-13 02:40:06 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/2a9aff733a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=188) INFO 04-13 02:40:06 [backends.py:1111] Dynamo bytecode transform time: 8.71 s
(EngineCore pid=188) [rank0]:W0413 02:40:08.913000 188 torch/_inductor/utils.py:1679] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=188) INFO 04-13 02:40:12 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=188) INFO 04-13 02:41:51 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 103.88 s
(EngineCore pid=188) INFO 04-13 02:41:53 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/9ff78e8d7a0504a92bb3d98862d6d0c171b249edd524a58aeb6d9ea2a550e9d0/rank_0_0/model
(EngineCore pid=188) INFO 04-13 02:41:53 [monitor.py:48] torch.compile took 116.11 s in total
(EngineCore pid=188) Process EngineCore:
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] super().__init__(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.collective_rpc("determine_available_memory")
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] self.model_runner.profile_run()
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] outputs = self.model(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.runnable(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self._call_impl(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return forward_call(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 691, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] hidden_states = self.language_model.model(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 603, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] output = self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.fn(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 500, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] def forward(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 211, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.optimized_call(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] raise e
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self._call_impl(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return forward_call(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "<eval_with_key>.231", line 330, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.runnable(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return range_entry.runnable(*args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self._compiled_fn(*args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return fn(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1148, in forward
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return compiled_fn(full_args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] all_outs = call_func_at_runtime_with_args(
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] out = normalize_as_list(f(args))
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1962, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.compiled_fn(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return compiled_fn(runtime_args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729, in inner_fn
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] outs = compiled_fn(args)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self.current_callable(inputs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] out = model(new_inputs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/tmp/torchinductor_root/uv/cuv4tqzh434ucxieeg2q7wolbunkpbawuojnokscd5uk25eut3q4.py", line 659, in call
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] torch.ops._C.cutlass_scaled_mm.default(buf8, buf3, reinterpret_tensor(arg5_1, (2048, 12288), (1, 2048), 0), reinterpret_tensor(buf4, (s18, 16), (1, s18), 0), reinterpret_tensor(arg6_1, (16, 96), (1, 16), 0), None)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] return self._op(*args, **kwargs)
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) ERROR 04-13 02:41:54 [core.py:1108] RuntimeError: Error Internal
(EngineCore pid=188) Traceback (most recent call last):
(EngineCore pid=188) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=188) self.run()
(EngineCore pid=188) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=188) self._target(*self._args, **self._kwargs)
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core
(EngineCore pid=188) raise e
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=188) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188) return func(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=188) super().__init__(
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=188) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=188) return func(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=188) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=188) return self.collective_rpc("determine_available_memory")
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=188) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=188) return func(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188) return func(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(EngineCore pid=188) self.model_runner.profile_run()
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run
(EngineCore pid=188) hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=188) ^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=188) return func(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run
(EngineCore pid=188) outputs = self.model(
(EngineCore pid=188) ^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188) return self.runnable(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188) return self._call_impl(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188) return forward_call(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 691, in forward
(EngineCore pid=188) hidden_states = self.language_model.model(
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 603, in __call__
(EngineCore pid=188) output = self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore pid=188) return self.fn(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 500, in forward
(EngineCore pid=188) def forward(
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 211, in __call__
(EngineCore pid=188) return self.optimized_call(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore pid=188) return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore pid=188) raise e
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore pid=188) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=188) return self._call_impl(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=188) return forward_call(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "<eval_with_key>.231", line 330, in forward
(EngineCore pid=188) submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_); l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_scale_inv_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(EngineCore pid=188) return self.runnable(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__
(EngineCore pid=188) return range_entry.runnable(*args)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore pid=188) return self._compiled_fn(*args)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore pid=188) return fn(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1148, in forward
(EngineCore pid=188) return compiled_fn(full_args)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore pid=188) all_outs = call_func_at_runtime_with_args(
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore pid=188) out = normalize_as_list(f(args))
(EngineCore pid=188) ^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1962, in __call__
(EngineCore pid=188) return self.compiled_fn(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore pid=188) return compiled_fn(runtime_args)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 729, in inner_fn
(EngineCore pid=188) outs = compiled_fn(args)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore pid=188) return self.current_callable(inputs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore pid=188) out = model(new_inputs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^
(EngineCore pid=188) File "/tmp/torchinductor_root/uv/cuv4tqzh434ucxieeg2q7wolbunkpbawuojnokscd5uk25eut3q4.py", line 659, in call
(EngineCore pid=188) torch.ops._C.cutlass_scaled_mm.default(buf8, buf3, reinterpret_tensor(arg5_1, (2048, 12288), (1, 2048), 0), reinterpret_tensor(buf4, (s18, 16), (1, s18), 0), reinterpret_tensor(arg6_1, (16, 96), (1, 16), 0), None)
(EngineCore pid=188) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
(EngineCore pid=188) return self._op(*args, **kwargs)
(EngineCore pid=188) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=188) RuntimeError: Error Internal
[rank0]:[W413 02:41:54.732014588 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 887, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}