Good afternoon @eugr, 2x nodes, GPT-OSS-120b, mxfp4 build works perfectly! Ty so much for helping me (& everyone else). This was my first successful multi-node run and validates the rest of my setup.
For others…
Build:
./build-and-copy.sh \
--exp-mxfp4 \
--rebuild-deps \
--tag harbor.k8s.wm.k8slab/dgx/eugr-vllm-mxfp4:20260130
Launch:
./launch-cluster.sh \
--name eugr-vllm-cluster \
-t harbor.k8s.wm.k8slab/dgx/eugr-vllm-mxfp4:20260130 \
exec \
vllm serve \
/models/gpt-oss-120b \
--port=8000 \
--host=0.0.0.0 \
--gpu-memory-utilization=0.7 \
-tp 2 \
--distributed-executor-backend ray \
--load-format fastsafetensors
Full log:
Auto-detecting interfaces...
Detected IB_IF: rocep1s0f0,roceP2p1s0f0
Detected ETH_IF: enp1s0f0np0
Detected Local IP: 192.168.100.10 (192.168.100.10/31)
Auto-detecting nodes...
Scanning for SSH peers on 192.168.100.10/31...
Found peer: 192.168.100.11
Cluster Nodes: 192.168.100.10,192.168.100.11
Head Node: 192.168.100.10
Worker Nodes: 192.168.100.11
Container Name: eugr-vllm-cluster
Image Name: harbor.k8s.wm.k8slab/dgx/eugr-vllm-mxfp4:20260130
Action: exec
Checking SSH connectivity to worker nodes...
SSH to 192.168.100.11: OK
Starting Head Node on 192.168.100.10...
144664591dc536b1569ac5d63a6446a53af5a78a9732c0ebb97da8c30a853470
Starting Worker Node on 192.168.100.11...
cac6cb26f11d2a43a486891f8215204b06a1ccf11078933fcf31dc690e14ad8b
Waiting for cluster to be ready...
Cluster head is responsive.
Executing command on head node: vllm serve /models/gpt-oss-120b --port=8000 --host=0.0.0.0 --gpu-memory-utilization=0.7 -tp 2 --distributed-executor-backend ray --load-format fastsafetensors
[2026-01-30 21:57:50] INFO font_manager.py:1639: generated new fontManager
(APIServer pid=1161) INFO 01-30 21:57:50 [api_server.py:1278] vLLM API server version 0.1.dev12774+g459541683.d20260130
(APIServer pid=1161) INFO 01-30 21:57:50 [utils.py:253] non-default args: {'model_tag': '/models/gpt-oss-120b', 'host': '0.0.0.0', 'model': '/models/gpt-oss-120b', 'load_format': 'fastsafetensors', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7}
(APIServer pid=1161) INFO 01-30 21:57:54 [model.py:538] Resolved architecture: GptOssForCausalLM
(APIServer pid=1161) INFO 01-30 21:57:54 [model.py:1531] Using max model len 131072
(APIServer pid=1161) WARNING 01-30 21:57:54 [vllm.py:1439] Current vLLM config is not set.
(APIServer pid=1161) INFO 01-30 21:57:54 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1161) INFO 01-30 21:57:54 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=1161) INFO 01-30 21:57:54 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=1161) INFO 01-30 21:57:55 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1161) INFO 01-30 21:57:55 [config.py:307] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=1161) WARNING 01-30 21:57:55 [vllm.py:621] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=1161) INFO 01-30 21:57:55 [vllm.py:640] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:00 [core.py:96] Initializing a V1 LLM engine (v0.1.dev12774+g459541683.d20260130) with config: model='/models/gpt-oss-120b', speculative_config=None, tokenizer='/models/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=/models/gpt-oss-120b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=1336) WARNING 01-30 21:58:00 [ray_utils.py:334] Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
(EngineCore_DP0 pid=1336) 2026-01-30 21:58:00,241 INFO worker.py:1821 -- Connecting to existing Ray cluster at address: 192.168.100.10:6379...
(EngineCore_DP0 pid=1336) 2026-01-30 21:58:00,248 INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at http://192.168.100.10:8265
(EngineCore_DP0 pid=1336) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=1336) warnings.warn(
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:01 [ray_utils.py:399] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=1336) WARNING 01-30 21:58:01 [ray_utils.py:210] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node afc91b073d4242a859cd217499b587964ac4400eb5dfeb77be84eb1c. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_DP0 pid=1336) WARNING 01-30 21:58:01 [ray_utils.py:210] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node da111a2f2abe0db01e307210a7ddae789ed3b94511057fd6a4bac3ac. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:07 [ray_env.py:66] RAY_NON_CARRY_OVER_ENV_VARS from config: set()
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:07 [ray_env.py:69] Copying the following environment variables to workers: ['CUDA_HOME', 'LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=1336) INFO 01-30 21:58:07 [ray_env.py:74] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:07 [system_utils.py:36] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:08 [worker_base.py:296] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:09 [parallel_state.py:1214] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://192.168.100.10:36153 backend=nccl
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:09 [pynccl.py:111] vLLM is using nccl==2.28.9
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:10 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:10 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:10 [parallel_state.py:1425] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:11 [gpu_model_runner.py:3804] Starting to load model /models/gpt-oss-120b...
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [flashinfer.py:334] SM12x detected - using native FlashInfer CUTLASS attention instead of TRT-LLM attention (cubins not available for SM12x)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:22 [mxfp4.py:223] [MXFP4] Auto-selected: cutlass (FlashInfer CUTLASS FP8×FP4 for SM12x)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:07 [system_utils.py:36] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64'
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) WARNING 01-30 21:58:09 [worker_base.py:296] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:09 [parallel_state.py:1214] world_size=2 rank=1 local_rank=0 distributed_init_method=tcp://192.168.100.10:36153 backend=nccl
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:10 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.1 not supported, communicator is not available.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) WARNING 01-30 21:58:10 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:10 [parallel_state.py:1425] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) warnings.warn(
Loading safetensors using Fastsafetensor loader: 12% Completed | 1/8 [00:03<00:21, 3.03s/it]
Loading safetensors using Fastsafetensor loader: 25% Completed | 2/8 [00:05<00:14, 2.46s/it]
Loading safetensors using Fastsafetensor loader: 38% Completed | 3/8 [00:07<00:11, 2.27s/it]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) warnings.warn(
Loading safetensors using Fastsafetensor loader: 50% Completed | 4/8 [00:08<00:08, 2.05s/it]
Loading safetensors using Fastsafetensor loader: 62% Completed | 5/8 [00:10<00:06, 2.03s/it]
Loading safetensors using Fastsafetensor loader: 75% Completed | 6/8 [00:12<00:04, 2.03s/it]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:38 [default_loader.py:291] Loading weights took 14.52 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [flashinfer.py:334] SM12x detected - using native FlashInfer CUTLASS attention instead of TRT-LLM attention (cubins not available for SM12x)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:23 [mxfp4.py:223] [MXFP4] Auto-selected: cutlass (FlashInfer CUTLASS FP8×FP4 for SM12x)
Loading safetensors using Fastsafetensor loader: 88% Completed | 7/8 [00:14<00:01, 1.93s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:16<00:00, 1.84s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:16<00:00, 2.03s/it]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:41 [gpu_model_runner.py:3901] Model loading took 33.0565 GiB memory and 30.153112 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:46 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/7899cc7b9c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:46 [backends.py:704] Dynamo bytecode transform time: 4.21 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:40 [default_loader.py:291] Loading weights took 16.25 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) [rank0]:W0130 21:58:47.428000 1459 torch/_inductor/utils.py:1725] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 21:58:50 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:41 [gpu_model_runner.py:3901] Model loading took 33.0565 GiB memory and 30.179833 seconds
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:10 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 83.25 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:10 [monitor.py:34] torch.compile takes 87.53 s in total
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:46 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/7899cc7b9c/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:46 [backends.py:704] Dynamo bytecode transform time: 4.28 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 21:58:50 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:12 [gpu_worker.py:356] Available KV cache memory: 45.750000 GiB
(EngineCore_DP0 pid=1336) INFO 01-30 22:00:12 [kv_cache_utils.py:1305] GPU KV cache size: 1,332,688 tokens
(EngineCore_DP0 pid=1336) INFO 01-30 22:00:12 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 20.00x
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:14 [utils.py:475] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) 2026-01-30 22:00:20,146 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) 2026-01-30 22:00:20,166 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) [rank1]:W0130 21:58:47.479000 383 torch/_inductor/utils.py:1725] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:20 [kernel_warmup.py:64] Warming up FlashInfer attention.
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:10 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 83.28 s
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:10 [monitor.py:34] torch.compile takes 87.49 s in total
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:12 [gpu_worker.py:356] Available KV cache memory: 45.950000 GiB
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=1459) INFO 01-30 22:00:20 [utils.py:475] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/83 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) 2026-01-30 22:00:20,147 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) 2026-01-30 22:00:20,160 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(REMOVED MODEL LOADING ENTRIES...)
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:01:08 [gpu_model_runner.py:4856] Graph capturing finished in 12 secs, took 4.62 GiB
(EngineCore_DP0 pid=1336) (RayWorkerWrapper pid=383, ip=192.168.100.11) INFO 01-30 22:00:20 [kernel_warmup.py:64] Warming up FlashInfer attention.
(EngineCore_DP0 pid=1336) INFO 01-30 22:01:10 [core.py:273] init engine (profile, create kv cache, warmup model) took 148.67 seconds
(EngineCore_DP0 pid=1336) INFO 01-30 22:01:11 [vllm.py:640] Asynchronous scheduling is disabled.
(APIServer pid=1161) INFO 01-30 22:01:11 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=1161) WARNING 01-30 22:01:11 [serving_responses.py:222] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=1161) INFO 01-30 22:01:12 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=1161) INFO 01-30 22:01:12 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1161) INFO 01-30 22:01:12 [serving_chat.py:216] Chat template warmup completed in 509.7ms
(APIServer pid=1161) INFO 01-30 22:01:12 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:38] Available routes are:
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1161) INFO 01-30 22:01:12 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1161) INFO: Started server process [1161]
(APIServer pid=1161) INFO: Waiting for application startup.
(APIServer pid=1161) INFO: Application startup complete.
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:535] RAY_CGRAPH_get_timeout is set to 300
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:539] VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE = auto
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:543] VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM = False
(EngineCore_DP0 pid=1336) INFO 01-30 22:02:30 [ray_executor.py:602] Using RayPPCommunicator (which wraps vLLM _PP GroupCoordinator) for Ray Compiled Graph communication.
(APIServer pid=1161) INFO 01-30 22:02:32 [loggers.py:257] Engine 000: Avg prompt throughput: 15.8 tokens/s, Avg generation throughput: 11.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1161) INFO 01-30 22:02:42 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 48.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1161) INFO: 127.0.0.1:53698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1161) INFO 01-30 22:02:52 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1161) INFO 01-30 22:03:02 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
As I mentioned earlier I thought the issue may have been my RoCE configuration so I am sharing /etc/netplan/40-cx7.yaml here in case anyone finds it useful:
Master
network:
version: 2
renderer: networkd
ethernets:
# RoCE Link A (fabric A)
enp1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
addresses:
- 192.168.100.10/31
optional: true
# Unused / unplugged port
enp1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
optional: true
# RoCE Link B (fabric B)
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
addresses:
- 192.168.101.0/31
optional: true
# Unused / unplugged port
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
optional: true
Worker
network:
version: 2
renderer: networkd
ethernets:
# RoCE Link A (fabric A)
enp1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
addresses:
- 192.168.100.11/31
optional: true
# Unused / unplugged port
enp1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
optional: true
# RoCE Link B (fabric B)
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: []
addresses:
- 192.168.101.1/31
optional: true
# Unused / unplugged port
enP2p1s0f1np1:
dhcp4: no
dhcp6: no
link-local: []
optional: true