I run this on a ASUS GX10:
❯ ./launch-cluster.sh --solo \
exec vllm serve Qwen/Qwen3-Coder-Next-FP8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 --port 8888 \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-prefix-caching
Solo mode enabled. Skipping node detection.
Head Node: 127.0.0.1
Worker Nodes:
Container Name: vllm_node
Image Name: vllm-node
Action: exec
Starting Head Node on 127.0.0.1...
f9be5655dd7d9c159a39729e4974a696be7ed2899360f76c785cfe35da872b1e
Solo mode active: Skipping Ray cluster readiness check.
Executing command on head node: vllm serve Qwen/Qwen3-Coder-Next-FP8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8888 --load-format fastsafetensors --attention-backend flashinfer --enable-prefix-caching
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287] █ █ █▄ ▄█
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.0rc2.dev126+gb96f7314b.d20260212
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287] █▄█▀ █ █ █ █ model Qwen/Qwen3-Coder-Next-FP8
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:287]
(APIServer pid=116) INFO 02-12 08:20:02 [utils.py:223] non-default args: {'model_tag': 'Qwen/Qwen3-Coder-Next-FP8', 'host': '0.0.0.0', 'port': 8888, 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'Qwen/Qwen3-Coder-Next-FP8', 'load_format': 'fastsafetensors', 'attention_backend': 'flashinfer', 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': True}
(APIServer pid=116) WARNING 02-12 08:20:02 [envs.py:1625] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=116) INFO 02-12 08:20:04 [model.py:531] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=116) INFO 02-12 08:20:04 [model.py:1555] Using max model len 262144
(APIServer pid=116) INFO 02-12 08:20:05 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=116) WARNING 02-12 08:20:05 [config.py:337] Mamba cache mode is set to 'align' for Qwen3NextForCausalLM by default when prefix caching is enabled
(APIServer pid=116) INFO 02-12 08:20:05 [config.py:361] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=116) INFO 02-12 08:20:05 [config.py:504] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=116) INFO 02-12 08:20:05 [config.py:535] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=116) WARNING 02-12 08:20:05 [vllm.py:689] Async scheduling is not compatible with prefix caching for Mamba models and will be disabled.
(APIServer pid=116) INFO 02-12 08:20:05 [vllm.py:698] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=256) INFO 02-12 08:20:12 [core.py:97] Initializing a V1 LLM engine (v0.16.0rc2.dev126+gb96f7314b.d20260212) with config: model='Qwen/Qwen3-Coder-Next-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Coder-Next-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-Coder-Next-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=256) INFO 02-12 08:20:12 [parallel_state.py:1246] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://28.0.0.1:53493 backend=nccl
[W212 08:20:22.751247172 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(EngineCore_DP0 pid=256) INFO 02-12 08:20:22 [parallel_state.py:1474] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=256) INFO 02-12 08:20:23 [gpu_model_runner.py:4124] Starting to load model Qwen/Qwen3-Coder-Next-FP8...
(EngineCore_DP0 pid=256) INFO 02-12 08:20:33 [fp8.py:338] Using TRITON Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'TRITON', 'BATCHED_TRITON', 'MARLIN', 'XPU'].
(EngineCore_DP0 pid=256) INFO 02-12 08:20:34 [cuda.py:331] Using AttentionBackendEnum.FLASHINFER backend.
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/40 [00:00<?, ?it/s]
(EngineCore_DP0 pid=256) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=256) warnings.warn(
Loading safetensors using Fastsafetensor loader: 2% Completed | 1/40 [00:02<01:42, 2.62s/it]
Loading safetensors using Fastsafetensor loader: 5% Completed | 2/40 [00:04<01:19, 2.10s/it]
Loading safetensors using Fastsafetensor loader: 8% Completed | 3/40 [00:06<01:14, 2.02s/it]
Loading safetensors using Fastsafetensor loader: 10% Completed | 4/40 [00:08<01:09, 1.92s/it]
Loading safetensors using Fastsafetensor loader: 12% Completed | 5/40 [00:09<01:04, 1.86s/it]
Loading safetensors using Fastsafetensor loader: 15% Completed | 6/40 [00:11<01:01, 1.82s/it]
Loading safetensors using Fastsafetensor loader: 18% Completed | 7/40 [00:13<01:01, 1.86s/it]
Loading safetensors using Fastsafetensor loader: 20% Completed | 8/40 [00:15<00:58, 1.83s/it]
Loading safetensors using Fastsafetensor loader: 22% Completed | 9/40 [00:16<00:55, 1.80s/it]
Loading safetensors using Fastsafetensor loader: 25% Completed | 10/40 [00:18<00:51, 1.72s/it]
Loading safetensors using Fastsafetensor loader: 28% Completed | 11/40 [00:20<00:48, 1.67s/it]
Loading safetensors using Fastsafetensor loader: 30% Completed | 12/40 [00:22<00:49, 1.76s/it]
Loading safetensors using Fastsafetensor loader: 32% Completed | 13/40 [00:23<00:46, 1.70s/it]
Loading safetensors using Fastsafetensor loader: 35% Completed | 14/40 [00:25<00:44, 1.70s/it]
Loading safetensors using Fastsafetensor loader: 38% Completed | 15/40 [00:26<00:42, 1.70s/it]
Loading safetensors using Fastsafetensor loader: 40% Completed | 16/40 [00:28<00:40, 1.70s/it]
Loading safetensors using Fastsafetensor loader: 42% Completed | 17/40 [00:30<00:40, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 45% Completed | 18/40 [00:32<00:39, 1.79s/it]
Loading safetensors using Fastsafetensor loader: 48% Completed | 19/40 [00:34<00:37, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 50% Completed | 20/40 [00:35<00:35, 1.76s/it]
Loading safetensors using Fastsafetensor loader: 52% Completed | 21/40 [00:37<00:33, 1.75s/it]
Loading safetensors using Fastsafetensor loader: 55% Completed | 22/40 [00:39<00:31, 1.75s/it]
Loading safetensors using Fastsafetensor loader: 57% Completed | 23/40 [00:41<00:30, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 60% Completed | 24/40 [00:43<00:28, 1.77s/it]
Loading safetensors using Fastsafetensor loader: 62% Completed | 25/40 [00:44<00:25, 1.71s/it]
Loading safetensors using Fastsafetensor loader: 65% Completed | 26/40 [00:46<00:24, 1.73s/it]
Loading safetensors using Fastsafetensor loader: 68% Completed | 27/40 [00:47<00:21, 1.68s/it]
Loading safetensors using Fastsafetensor loader: 70% Completed | 28/40 [00:49<00:20, 1.71s/it]
Loading safetensors using Fastsafetensor loader: 72% Completed | 29/40 [00:51<00:19, 1.80s/it]
Loading safetensors using Fastsafetensor loader: 75% Completed | 30/40 [00:53<00:17, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 78% Completed | 31/40 [00:55<00:15, 1.77s/it]
Loading safetensors using Fastsafetensor loader: 80% Completed | 32/40 [00:56<00:14, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 82% Completed | 33/40 [00:58<00:12, 1.72s/it]
Loading safetensors using Fastsafetensor loader: 85% Completed | 34/40 [01:00<00:10, 1.68s/it]
Loading safetensors using Fastsafetensor loader: 88% Completed | 35/40 [01:01<00:08, 1.65s/it]
Loading safetensors using Fastsafetensor loader: 90% Completed | 36/40 [01:03<00:07, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 92% Completed | 37/40 [01:05<00:05, 1.79s/it]
Loading safetensors using Fastsafetensor loader: 95% Completed | 38/40 [01:07<00:03, 1.78s/it]
Loading safetensors using Fastsafetensor loader: 98% Completed | 39/40 [01:09<00:01, 1.74s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 40/40 [01:10<00:00, 1.72s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 40/40 [01:10<00:00, 1.77s/it]
(EngineCore_DP0 pid=256)
(EngineCore_DP0 pid=256) INFO 02-12 08:21:49 [default_loader.py:293] Loading weights took 70.73 seconds
(EngineCore_DP0 pid=256) INFO 02-12 08:21:49 [fp8.py:495] Using MoEPrepareAndFinalizeNoEP
(EngineCore_DP0 pid=256) INFO 02-12 08:21:49 [gpu_model_runner.py:4221] Model loading took 74.89 GiB memory and 85.793719 seconds
(EngineCore_DP0 pid=256) INFO 02-12 08:21:56 [backends.py:918] Using cache directory: /root/.cache/vllm/torch_compile_cache/124adb2cb9/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=256) INFO 02-12 08:21:56 [backends.py:978] Dynamo bytecode transform time: 5.81 s
(EngineCore_DP0 pid=256) WARNING 02-12 08:21:58 [fused_moe.py:1089] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=256) INFO 02-12 08:22:22 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 25.071 s
(EngineCore_DP0 pid=256) INFO 02-12 08:22:22 [monitor.py:34] torch.compile takes 30.88 s in total
(EngineCore_DP0 pid=256) INFO 02-12 08:22:24 [gpu_worker.py:375] Available KV cache memory: 16.74 GiB
(EngineCore_DP0 pid=256) INFO 02-12 08:22:24 [kv_cache_utils.py:1308] GPU KV cache size: 182,784 tokens
(EngineCore_DP0 pid=256) INFO 02-12 08:22:24 [kv_cache_utils.py:1313] Maximum concurrency for 262,144 tokens per request: 2.75x
(EngineCore_DP0 pid=256) 2026-02-12 08:22:25,688 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=256) 2026-02-12 08:22:33,177 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████| 51/51 [02:21<00:00, 2.77s/it]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████| 35/35 [02:46<00:00, 4.76s/it]
(EngineCore_DP0 pid=256) INFO 02-12 08:27:45 [gpu_model_runner.py:5247] Graph capturing finished in 313 secs, took 0.69 GiB
(EngineCore_DP0 pid=256) INFO 02-12 08:27:46 [core.py:278] init engine (profile, create kv cache, warmup model) took 356.60 seconds
(EngineCore_DP0 pid=256) INFO 02-12 08:27:49 [vllm.py:698] Asynchronous scheduling is disabled.
(APIServer pid=116) INFO 02-12 08:27:50 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=116) INFO 02-12 08:27:50 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=116) WARNING 02-12 08:27:50 [model.py:1356] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 40, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=116) INFO 02-12 08:27:50 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=116) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=116) [2026-02-12 08:27:50] WARNING _http.py:779: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=116) INFO 02-12 08:27:50 [serving.py:188] Warming up chat template processing...
(APIServer pid=116) INFO 02-12 08:27:54 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=116) INFO 02-12 08:27:54 [serving.py:213] Chat template warmup completed in 4036.3ms
(APIServer pid=116) INFO 02-12 08:27:55 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=116) INFO 02-12 08:27:55 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:38] Available routes are:
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=116) INFO 02-12 08:27:55 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=116) INFO: Started server process [116]
(APIServer pid=116) INFO: Waiting for application startup.
(APIServer pid=116) INFO: Application startup complete.
but only got about 2tok/s:
❯ docker exec -it vllm_node \
vllm bench serve \
--backend openai-chat \
--base-url http://127.0.0.1:8888 \
--endpoint /v1/chat/completions \
--model "Qwen/Qwen3-Coder-Next-FP8" \
--dataset-name random \
--random-input-len 64 \
--random-output-len 256 \
--num-prompts 200 \
--max-concurrency 8
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0xf33337a3c360>, trust_remote_code=False, seed=0, num_prompts=200, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, skip_chat_template=False, enable_multimodal_chat=False, disable_shuffle=False, custom_output_len=256, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, asr_max_audio_len_sec=inf, asr_min_audio_len_sec=0.0, random_input_len=64, random_output_len=256, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, no_reranker=False, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 1}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='openai-chat', base_url='http://127.0.0.1:8888', host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', header=None, max_concurrency=8, model='Qwen/Qwen3-Coder-Next-FP8', input_len=None, output_len=None, tokenizer=None, tokenizer_mode='auto', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, disable_tqdm=False, num_warmups=0, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics=None, metric_percentiles='99', goodput=None, request_id_prefix='bench-dc3001e4-', top_p=None, top_k=None, min_p=None, temperature=None, frequency_penalty=None, presence_penalty=None, repetition_penalty=None, served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=0, extra_body=None, skip_tokenizer_init=False, insecure=False)
INFO 02-12 08:36:15 [datasets.py:607] Sampling input_len from [64, 64] and output_len from [256, 256]
WARNING: vllm bench serve no longer sets temperature==0 (greedy) in requests by default. The default will be determined on the server side and can be model/API specific. For the old behavior, include --temperature=0.
Starting initial single prompt test run...
Skipping endpoint ready check.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 8
0%| | 0/200 [00:00<?, ?it/s]
(APIServer pid=116) INFO: Application startup complete.
(APIServer pid=116) INFO: 127.0.0.1:46196 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46224 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO: 127.0.0.1:46266 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=116) INFO 02-12 08:37:05 [loggers.py:259] Engine 000: Avg prompt throughput: 14.4 tokens/s, Avg generation throughput: 0.2 tokens/s, Ru
nning: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Run
ning: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:25 [loggers.py:259] Engine 000: Avg prompt throughput: 43.2 tokens/s, Avg generation throughput: 0.8 tokens/s, Ru
nning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 5.6 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.2 toke
ns/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:37:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.2 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.2 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:25 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:38:55 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:05 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:15 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:25 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Run
ning: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%
(APIServer pid=116) INFO 02-12 08:39:45 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%