I don’t think V0 even supports these architectures anymore, but in any case, v1 was the default engine since last year at least.
Point in case, using my latest (unpublished) build with Johnny’s PRs - these PRs will be included in the next nightly run (if it doesn’t fail):
./run-recipe.sh -t vllm-node-20260330-nvfp4-cudnn-tf5 recipes/qwen3.5-122b-int4-autoround.yaml --port 8888 --served-model-name coder-250k --no-ray
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.1rc1.dev254+g494636b29.d20260330
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299] █▄█▀ █ █ █ █ model Intel/Qwen3.5-122B-A10B-int4-AutoRound
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:299]
(APIServer pid=69) INFO 03-30 22:35:40 [utils.py:233] non-default args: {'model_tag': 'Intel/Qwen3.5-122B-A10B-int4-AutoRound', 'chat_template': 'unsloth.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 8888, 'model': 'Intel/Qwen3.5-122B-A10B-int4-AutoRound', 'trust_remote_code': True, 'max_model_len': 262144, 'served_model_name': ['coder-250k'], 'load_format': 'fastsafetensors', 'reasoning_parser': 'qwen3', 'master_addr': '192.168.24.104', 'nnodes': 2, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True, 'max_num_batched_tokens': 8192}
(APIServer pid=69) WARNING 03-30 22:35:40 [envs.py:1749] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=69) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=69) INFO 03-30 22:35:41 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=69) INFO 03-30 22:35:41 [model.py:1679] Using max model len 262144
(APIServer pid=69) INFO 03-30 22:35:41 [arg_utils.py:1719] Inferred data_parallel_rank 0 from node_rank 0
(APIServer pid=69) INFO 03-30 22:35:41 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=69) WARNING 03-30 22:35:41 [config.py:253] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=69) INFO 03-30 22:35:41 [config.py:273] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=69) INFO 03-30 22:35:41 [vllm.py:789] Asynchronous scheduling is enabled.
(APIServer pid=69) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=69) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
/usr/local/lib/python3.12/dist-packages/torch/compiler/__init__.py:148: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
return torch._dynamo.allow_in_graph(fn)
(EngineCore pid=124) INFO 03-30 22:35:58 [core.py:105] Initializing a V1 LLM engine (v0.18.1rc1.dev254+g494636b29.d20260330) with config: model='Intel/Qwen3.5-122B-A10B-int4-AutoRound', speculative_config=None, tokenizer='Intel/Qwen3.5-122B-A10B-int4-AutoRound', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=inc, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=coder-250k, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=124) WARNING 03-30 22:35:58 [multiproc_executor.py:1014] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=124) INFO 03-30 22:35:58 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=192.168.24.104, mq_connect_ip=192.168.24.104 (local), world_size=2, local_world_size=1
/usr/local/lib/python3.12/dist-packages/torch/compiler/__init__.py:148: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
return torch._dynamo.allow_in_graph(fn)
`Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(Worker pid=171) INFO 03-30 22:36:03 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:29501 backend=nccl
(Worker pid=171) INFO 03-30 22:36:09 [pynccl.py:111] vLLM is using nccl==2.29.7
(Worker pid=171) WARNING 03-30 22:36:10 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
(Worker pid=171) INFO 03-30 22:36:10 [parallel_state.py:1716] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=171) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(Worker pid=171) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [gpu_model_runner.py:4737] Starting to load model Intel/Qwen3.5-122B-A10B-int4-AutoRound...
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [gptq_marlin.py:382] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=171) INFO 03-30 22:36:20 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=171) INFO 03-30 22:36:21 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker_TP0 pid=171) INFO 03-30 22:36:21 [flash_attn.py:607] Using FlashAttention version 2
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/8 [00:00<?, ?it/s]
Loading safetensors using Fastsafetensor loader: 12% Completed | 1/8 [00:06<00:48, 6.95s/it]
Loading safetensors using Fastsafetensor loader: 25% Completed | 2/8 [00:13<00:40, 6.76s/it]
Loading safetensors using Fastsafetensor loader: 38% Completed | 3/8 [00:20<00:34, 6.93s/it]
Loading safetensors using Fastsafetensor loader: 50% Completed | 4/8 [00:26<00:26, 6.65s/it]
Loading safetensors using Fastsafetensor loader: 62% Completed | 5/8 [00:34<00:20, 6.86s/it]
Loading safetensors using Fastsafetensor loader: 75% Completed | 6/8 [00:40<00:13, 6.55s/it]
Loading safetensors using Fastsafetensor loader: 88% Completed | 7/8 [00:43<00:05, 5.36s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:43<00:00, 3.90s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:43<00:00, 5.48s/it]
(Worker_TP0 pid=171)
(Worker_TP0 pid=171) INFO 03-30 22:37:06 [default_loader.py:384] Loading weights took 43.80 seconds
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [gpu_model_runner.py:4822] Model loading took 31.47 GiB memory and 47.181670 seconds
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [interface.py:586] Setting attention block size to 2096 tokens to ensure that attention page size is >= mamba page size.
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [interface.py:610] Padding mamba page size by 0.58% to ensure that mamba page size and attention page size are exactly equal.
(Worker_TP0 pid=171) INFO 03-30 22:37:08 [gpu_model_runner.py:5761] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1412: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171) allow_in_graph(einops.rearrange)
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1414: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171) allow_in_graph(einops.reduce)
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1417: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171) allow_in_graph(einops.repeat) # available since einops 0.2.0
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1420: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171) allow_in_graph(einops.einsum) # available since einops 0.5.0
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1423: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171) allow_in_graph(einops.pack) # available since einops 0.6.0
(Worker_TP0 pid=171) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/decorators.py:1426: FutureWarning: torch._dynamo.allow_in_graph is deprecated and will be removed in a future version. Use torch._dynamo.nonstrict_trace instead.
(Worker_TP0 pid=171) allow_in_graph(einops.unpack) # available since einops 0.6.0
(Worker_TP0 pid=171) INFO 03-30 22:37:20 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/3ce4635d9f/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=171) INFO 03-30 22:37:20 [backends.py:1111] Dynamo bytecode transform time: 5.39 s
(EngineCore pid=124) INFO 03-30 22:38:09 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=171) INFO 03-30 22:38:32 [backends.py:390] Compiling a graph for compile range (1, 8192) takes 72.03 s
(Worker_TP0 pid=171) INFO 03-30 22:38:37 [backends.py:895] collected artifacts: 49 entries, 39 artifacts, 192096968 bytes total
(Worker_TP0 pid=171) INFO 03-30 22:38:37 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/841de02c07eaef4369f98f98168acac541deb81e0d029b6009864039cc79cfb5/rank_0_0/model
(Worker_TP0 pid=171) INFO 03-30 22:38:37 [monitor.py:48] torch.compile took 82.85 s in total
(EngineCore pid=124) INFO 03-30 22:39:09 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=171) INFO 03-30 22:40:06 [monitor.py:76] Initial profiling/warmup run took 89.13 s
(EngineCore pid=124) INFO 03-30 22:40:09 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=171) INFO 03-30 22:40:11 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(Worker_TP0 pid=171) INFO 03-30 22:40:12 [gpu_model_runner.py:5884] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(Worker_TP0 pid=171) INFO 03-30 22:40:45 [gpu_model_runner.py:5963] Estimated CUDA graph memory: 1.53 GiB total
(Worker_TP0 pid=171) INFO 03-30 22:40:45 [gpu_worker.py:436] Available KV cache memory: 46.8 GiB
(Worker_TP0 pid=171) INFO 03-30 22:40:45 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.7000 to 0.7126 to maintain the same effective KV cache size.
(EngineCore pid=124) INFO 03-30 22:40:45 [kv_cache_utils.py:1319] GPU KV cache size: 1,018,656 tokens
(EngineCore pid=124) INFO 03-30 22:40:45 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 14.73x
(Worker_TP0 pid=171) 2026-03-30 22:40:48,462 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=171) 2026-03-30 22:40:49,052 - INFO - autotuner.py:464 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00, 8.33it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00, 8.91it/s]
(Worker_TP0 pid=171) INFO 03-30 22:41:00 [gpu_model_runner.py:6052] Graph capturing finished in 12 secs, took 0.83 GiB
(Worker_TP0 pid=171) INFO 03-30 22:41:00 [gpu_worker.py:597] CUDA graph pool memory: 0.83 GiB (actual), 1.53 GiB (estimated), difference: 0.7 GiB (83.8%).
(EngineCore pid=124) INFO 03-30 22:41:00 [core.py:283] init engine (profile, create kv cache, warmup model) took 232.16 seconds
(EngineCore pid=124) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=124) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=124) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=124) INFO 03-30 22:41:12 [vllm.py:789] Asynchronous scheduling is enabled.
(APIServer pid=69) INFO 03-30 22:41:12 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=69) INFO 03-30 22:41:13 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=69) WARNING 03-30 22:41:13 [model.py:1436] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=69) INFO 03-30 22:41:13 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=69) INFO 03-30 22:41:20 [base.py:231] Multi-modal warmup completed in 6.375s
(APIServer pid=69) INFO 03-30 22:41:20 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8888
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:37] Available routes are:
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=69) INFO 03-30 22:41:20 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=69) INFO: Started server process [69]
(APIServer pid=69) INFO: Waiting for application startup.
(APIServer pid=69) INFO: Application startup complete.
| model |
test |
t/s |
peak t/s |
ttfr (ms) |
est_ppt (ms) |
e2e_ttft (ms) |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
pp2048 |
3496.38 ± 116.70 |
|
594.66 ± 19.99 |
586.70 ± 19.99 |
594.83 ± 19.95 |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
tg32 |
45.22 ± 0.20 |
46.69 ± 0.21 |
|
|
|
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
pp2048 @ d4096 |
3497.13 ± 368.29 |
|
1786.36 ± 202.64 |
1778.40 ± 202.64 |
1786.47 ± 202.59 |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
tg32 @ d4096 |
44.87 ± 0.02 |
46.32 ± 0.01 |
|
|
|
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
pp2048 @ d8192 |
3688.09 ± 39.16 |
|
2785.05 ± 29.71 |
2777.09 ± 29.71 |
2785.16 ± 29.71 |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
tg32 @ d8192 |
45.85 ± 2.13 |
47.35 ± 2.20 |
|
|
|
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
pp2048 @ d16384 |
3666.54 ± 15.81 |
|
5035.41 ± 21.75 |
5027.45 ± 21.75 |
5035.58 ± 21.82 |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
tg32 @ d16384 |
44.83 ± 1.54 |
46.30 ± 1.59 |
|
|
|
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
pp2048 @ d32078 |
3384.99 ± 14.77 |
|
10089.92 ± 44.19 |
10081.96 ± 44.19 |
10090.00 ± 44.21 |
| Intel/Qwen3.5-122B-A10B-int4-AutoRound |
tg32 @ d32078 |
43.94 ± 1.52 |
45.37 ± 1.57 |
|
|
|
llama-benchy (0.3.5)
date: 2026-03-30 15:52:18 | latency mode: api | pp basis: ttfr