Running Qwen/Qwen3.5-35B-A3B-FP8 on a cluster

@eugr I have three DGX Sparks connected together: spark-01, spark-02, and spark-03.

I want to run Qwen/Qwen3.5-35B-A3B-FP8 in a cluster on spark-02 and spark-03, leaving spark-01 alone to run Intel/Qwen3.5-122B-A10B-int4-AutoRound.

spark-01 can ping and ssh into both spark-02 and spark03.

I ran the command from spark-01

./hf-download.sh qwen/qwen3.5-35b-a3b-fp8 -c --copy-parallel

but it only copied the model to spark-02.

Q01: What should I do to ensure hf-download.sh copies the model to both spark-02 and spark-03 when running the hf-download.sh command from spark-01?

Q02: What is the correct command to launch Qwen/Qwen3.5-35B-A3B-FP8 on a 2-node cluster comprising of spark-02 and spark-03 ?

I figured out the command for this:

HOSTS="spark-02 spark-03"
./hf-download.sh qwen/qwen3.5-35b-a3b-fp8 -c $HOSTS --copy-parallel

It’s a bit more tricky, because you need to ensure you use the same interface to connect spark-02 and spark-03. If it’s the case, you will need to specify this interface as arguments to ./run-recipe.sh or ./launch-cluster.sh instead of relying on autodiscovery, e.g. if you are using enp1s0f0np1 to connect spark-01 to spark-02, but using enp1s0f0np0 to connect spark-02 to spark-03 (on both!), then you can run:

./run-recipe.sh qwen3.5-35b-a3b-fp8 --eth-if enp1s0f0np0 --ib-if rocep1s0f0,roceP2p1s0f0 -n $NODE_IPs

Make sure you use node IPs that belong to connectX interfaces, not 10G ones.

I’ve just got my 3rd Spark, so I’ll be able to test various configurations and see how this can be improved.

@eugr I have simplified my setup to just two nodes in a cluster: spark-02 and spark-03

I followed the exact configuration settings including IP address values, as described in the Running on two sparks playbook

In terms of physical connectivity, port 0 is left un-connected, and port 1 is used to connect the two sparks together.

Because I am using docker-usernamespace remapping, I have patched the current version of your repo with the following patch:

diff --git a/launch-cluster.sh b/launch-cluster.sh
index f11ab11..f6d7c80 100755
--- a/launch-cluster.sh
+++ b/launch-cluster.sh
@@ -84,6 +84,7 @@ while [[ "$#" -gt 0 ]]; do
         --name) CONTAINER_NAME="$2"; shift ;;
         --eth-if) ETH_IF="$2"; shift ;;
         --ib-if) IB_IF="$2"; shift ;;
+        --localhost-port) DOCKER_ARGS="$DOCKER_ARGS -p 127.0.0.1:$2:$2"; shift ;;
         -e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
         -j) BUILD_JOBS="$2"; shift ;;
         --apply-mod) MOD_PATHS+=("$2"); shift ;;
@@ -640,7 +641,7 @@ start_cluster() {
     fi
 
     # Build docker run arguments based on mode
-    local docker_args_common="--gpus all -d --rm --network host --name $CONTAINER_NAME $DOCKER_ARGS $IMAGE_NAME"
+    local docker_args_common="--runtime=nvidia -d --rm --name $CONTAINER_NAME $DOCKER_ARGS $IMAGE_NAME"
     local docker_caps_args=""
     local docker_resource_args=""

Here is the command that I run from spark-02

./launch-cluster.sh \
  -t vllm-node-tf5 \
  --non-privileged \
  --nodes "192.168.200.12 192.168.200.13" \
  --eth-if enp1s0f1np1 \
  --ib-if rocep1s0f1,roceP2p1s0f1 \
  --localhost-port 8888 \
  --apply-mod mods/fix-qwen3-coder-next \
  --apply-mod mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve qwen/qwen3.5-35b-a3b-fp8 \
  --host 0.0.0.0 \
  --port 8888 \
  --max-model-len 262144 \
  --max_num_batched_tokens: 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray

This fails to run with the following error:

Detected Local IP: 192.168.200.12 (192.168.200.12/24)
Error: Local IP (192.168.200.12) is not in the list of nodes (192.168.200.12 192.168.200.13)

It doesn’t make sense, because it has detected 192.168.200.12 as the head-node, and that value is in the list of nodes, but it gives an error.

It’s easier with sparkrun ;-)

Create an explicit named cluster for the target pair
sparkrun cluster create sparks23 -H spark-02,spark-03

I haven’t tested hostnames much, so IPs preferred but hostnames might work actually… but know that I don’t usually test that – also note that it’s preferred to use the management (10G interface) IPs rather than the CX7 IPs for making a “cluster” definition.

Run the recipe targeting the named cluster
sparkrun run qwen3.5-35b-a3b-fp8 --tp 2 --cluster sparks23

The same SSH setup requirements and everything else apply. sparkrun has commands to help with all of those pieces, but I assume you’ve already got that handled.

use --nodes 192.168.200.12,192.168.200.13

I updated the node list separated with a comma. It launches but then time out after sometime.


  Detected Local IP: 192.168.200.12 (192.168.200.12/24)
Head Node: 192.168.200.12
Worker Nodes: 192.168.200.13
Container Name: vllm_node
Image Name: vllm-node-tf5
Action: exec
Checking SSH connectivity to worker nodes...
  SSH to 192.168.200.13: OK
Running in non-privileged mode...
Starting Head Node on 192.168.200.12...
1d7f76d6c8ae1823ac646581b484a624c523ed737f272398bb4e4bc4f8667ba3
Starting Worker Node on 192.168.200.13...
c6d1952bb2e1a2a4b5722015d63229b3730b6c6270511e17be69915c70b4abcc
Applying modifications to cluster nodes...
Applying mod 'fix-qwen3-coder-next' to 192.168.200.12...
  Copying directory content to container...
Successfully copied 9.73kB to vllm_node:/workspace/mods/fix-qwen3-coder-next/
  Running patch script on 192.168.200.12...
Patching Qwen3-Coder-Next crashing on start
patching file vllm/v1/core/single_type_kv_cache_manager.py
Hunk #1 FAILED at 1000.
1 out of 1 hunk FAILED -- saving rejects to file vllm/v1/core/single_type_kv_cache_manager.py.rej
Patch is not applicable, skipping
Reverting PR #34279 that causes slowness
patching file vllm/model_executor/layers/fused_moe/fused_moe.py
Unreversed patch detected!  Ignore -R? [n] 
Apply anyway? [n] 
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file vllm/model_executor/layers/fused_moe/fused_moe.py.rej
Can't revert PR #34279, skipping as it was reverted in recent commits
Fixing Triton allocator bug
Applying mod 'fix-qwen3.5-chat-template' to 192.168.200.12...
  Copying directory content to container...
Successfully copied 11.3kB to vllm_node:/workspace/mods/fix-qwen3.5-chat-template/
  Running patch script on 192.168.200.12...
=======> to apply chat template, use --chat-template unsloth.jinja
Applying mod 'fix-qwen3-coder-next' to 192.168.200.13...
  Copying mod package to 192.168.200.13:/tmp/vllm_mod_pkg_1774023547_15602...
fix_crash.diff                                                                                            100%  712   486.3KB/s   00:00    
fix_slowness.diff                                                                                         100% 2129     3.6MB/s   00:00    
run.sh                                                                                                    100%  959     1.8MB/s   00:00    
_triton_alloc_setup.pth                                                                                   100%   27    40.6KB/s   00:00    
_triton_alloc_setup.py                                                                                    100%  257   165.3KB/s   00:00    
  Copying directory content to container...
  Running patch script on 192.168.200.13...
Patching Qwen3-Coder-Next crashing on start
patching file vllm/v1/core/single_type_kv_cache_manager.py
Hunk #1 FAILED at 1000.
1 out of 1 hunk FAILED -- saving rejects to file vllm/v1/core/single_type_kv_cache_manager.py.rej
Patch is not applicable, skipping
Reverting PR #34279 that causes slowness
patching file vllm/model_executor/layers/fused_moe/fused_moe.py
Unreversed patch detected!  Ignore -R? [n] 
Apply anyway? [n] 
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file vllm/model_executor/layers/fused_moe/fused_moe.py.rej
Can't revert PR #34279, skipping as it was reverted in recent commits
Fixing Triton allocator bug
Applying mod 'fix-qwen3.5-chat-template' to 192.168.200.13...
  Copying mod package to 192.168.200.13:/tmp/vllm_mod_pkg_1774023550_90...
chat_template.jinja                                                                                       100% 7817     6.5MB/s   00:00    
run.sh                                                                                                    100%  144   335.6KB/s   00:00    
  Copying directory content to container...
  Running patch script on 192.168.200.13...
=======> to apply chat template, use --chat-template unsloth.jinja
Starting Ray HEAD node on 192.168.200.12...
Starting Ray WORKER node on 192.168.200.13...
Waiting for cluster to be ready...
Timeout waiting for cluster to start.

Stopping cluster...
Stopping head node (192.168.200.12)...
Stopping worker node (192.168.200.13)...
Cluster stopped.

Q01: What could the reason for this?

Interesting. Looks like it fails to start Ray cluster for some reason. Can you try with --no-ray and see if it works? It will use torch distributed backend instead of Ray.

I am using docker-usernamespace remapping, so this would result in issues with ipc access.

I dont want to run a docker container which internally runs processes as root, and gains root privileges on the host computer.

I haven’t done too much lately with rootless, but for ray rootless, what’s the shm size set to? it needs to be sufficiently large to work right…

When I run this command:

./launch-cluster.sh \
  -t vllm-node-tf5 \
  --non-privileged \
  --nodes "192.168.200.12,192.168.200.13" \
  --eth-if enp1s0f1np1 \
  --ib-if rocep1s0f1,roceP2p1s0f1 \
  --localhost-port 8888 \
  --apply-mod mods/fix-qwen3-coder-next \
  --apply-mod mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve qwen/qwen3.5-35b-a3b-fp8 \
  --host 0.0.0.0 \
  --port 8888 \
  --max-model-len 262144 \
  --max_num_batched_tokens: 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  --tensor-parallel-size 2 \
  --no-ray

It still appears to be using Ray and doesn’t distribute the workload.

Starting Ray HEAD node on 192.168.200.12...
Starting Ray WORKER node on 192.168.200.13...
Waiting for cluster to be ready...
Timeout waiting for cluster to start.

Stopping cluster...
Stopping head node (192.168.200.12)...
Stopping worker node (192.168.200.13)...
Cluster stopped.

There has to be a way to enable IPC for the second node to allow ray to communicate with the head node with usernamespace enabled.

Sorry, forgot to mention that --no-ray is a launch-cluster.sh argument, not vLLMs.

Yeah, I haven’t tested Ray with usernamespace enabled, so don’t know.

Having said that, the native MP backend (torch distributed) is a bit lower level with less overhead, so it might work with your settings.

EDIT: ah, I see now. This won’t work without --network host in the Docker parameters. I see you removed it in your patch.

When I launch the cluster with the --no-ray option, it progresses a bit but stops.

./launch-cluster.sh \
  -t vllm-node-tf5 \
  --name vllm_node \
  --non-privileged \
  --nodes "192.168.200.12,192.168.200.13" \
  --eth-if enp1s0f1np1 \
  --ib-if rocep1s0f1,roceP2p1s0f1 \
  --localhost-port 8888 \
  --apply-mod mods/fix-qwen3-coder-next \
  --apply-mod mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  --no-ray \
  exec vllm serve qwen/qwen3.5-35b-a3b-fp8 \
  --host 0.0.0.0 \
  --port 8888 \
  --max-model-len 262144 \
  --max_num_batched_tokens 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  --tensor-parallel-size 2
  Detected Local IP: 192.168.200.12 (192.168.200.12/24)
Head Node: 192.168.200.12
Worker Nodes: 192.168.200.13
Container Name: vllm_node
Image Name: vllm-node-tf5
Action: exec
Checking SSH connectivity to worker nodes...
  SSH to 192.168.200.13: OK
Running in non-privileged mode...
Starting Head Node on 192.168.200.12...
4efdf734e9d2d695499330f1d0725b19c4de92cb25abd1e3d111cae6e7c812ce
Starting Worker Node on 192.168.200.13...
d08fe7c566189b1c170d11ca9daa9755feb3090ff8560007372783b8d41c03be
Applying modifications to cluster nodes...
Applying mod 'fix-qwen3-coder-next' to 192.168.200.12...
  Copying directory content to container...
Successfully copied 9.73kB to vllm_node:/workspace/mods/fix-qwen3-coder-next/
  Running patch script on 192.168.200.12...
Patching Qwen3-Coder-Next crashing on start
patching file vllm/v1/core/single_type_kv_cache_manager.py
Hunk #1 FAILED at 1000.
1 out of 1 hunk FAILED -- saving rejects to file vllm/v1/core/single_type_kv_cache_manager.py.rej
Patch is not applicable, skipping
Reverting PR #34279 that causes slowness
patching file vllm/model_executor/layers/fused_moe/fused_moe.py
Unreversed patch detected!  Ignore -R? [n] 
Apply anyway? [n] 
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file vllm/model_executor/layers/fused_moe/fused_moe.py.rej
Can't revert PR #34279, skipping as it was reverted in recent commits
Fixing Triton allocator bug
Applying mod 'fix-qwen3.5-chat-template' to 192.168.200.12...
  Copying directory content to container...
Successfully copied 11.3kB to vllm_node:/workspace/mods/fix-qwen3.5-chat-template/
  Running patch script on 192.168.200.12...
=======> to apply chat template, use --chat-template unsloth.jinja
Applying mod 'fix-qwen3-coder-next' to 192.168.200.13...
  Copying mod package to 192.168.200.13:/tmp/vllm_mod_pkg_1774026108_31933...
fix_crash.diff                                                                                            100%  712     1.0MB/s   00:00    
fix_slowness.diff                                                                                         100% 2129     2.4MB/s   00:00    
run.sh                                                                                                    100%  959     1.3MB/s   00:00    
_triton_alloc_setup.pth                                                                                   100%   27    47.5KB/s   00:00    
_triton_alloc_setup.py                                                                                    100%  257   350.5KB/s   00:00    
  Copying directory content to container...
  Running patch script on 192.168.200.13...
Patching Qwen3-Coder-Next crashing on start
patching file vllm/v1/core/single_type_kv_cache_manager.py
Hunk #1 FAILED at 1000.
1 out of 1 hunk FAILED -- saving rejects to file vllm/v1/core/single_type_kv_cache_manager.py.rej
Patch is not applicable, skipping
Reverting PR #34279 that causes slowness
patching file vllm/model_executor/layers/fused_moe/fused_moe.py
Unreversed patch detected!  Ignore -R? [n] 
Apply anyway? [n] 
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file vllm/model_executor/layers/fused_moe/fused_moe.py.rej
Can't revert PR #34279, skipping as it was reverted in recent commits
Fixing Triton allocator bug
Applying mod 'fix-qwen3.5-chat-template' to 192.168.200.13...
  Copying mod package to 192.168.200.13:/tmp/vllm_mod_pkg_1774026110_2731...
chat_template.jinja                                                                                       100% 7817     8.3MB/s   00:00    
run.sh                                                                                                    100%  144   245.4KB/s   00:00    
  Copying directory content to container...
  Running patch script on 192.168.200.13...
=======> to apply chat template, use --chat-template unsloth.jinja
Executing command: vllm serve qwen/qwen3.5-35b-a3b-fp8 --host 0.0.0.0 --port 8888 --max-model-len 262144 --max_num_batched_tokens 16384 --gpu-memory-utilization 0.7 --enable-auto-tool-choice --tool-call-parser qwen3_coder --kv-cache-dtype fp8 --load-format fastsafetensors --attention-backend flashinfer --enable-prefix-caching --chat-template unsloth.jinja --tensor-parallel-size 2 
Launching worker (rank 1) on 192.168.200.13...
Executing command on head node (rank 0): vllm serve qwen/qwen3.5-35b-a3b-fp8 --host 0.0.0.0 --port 8888 --max-model-len 262144 --max_num_batched_tokens 16384 --gpu-memory-utilization 0.7 --enable-auto-tool-choice --tool-call-parser qwen3_coder --kv-cache-dtype fp8 --load-format fastsafetensors --attention-backend flashinfer --enable-prefix-caching --chat-template unsloth.jinja --tensor-parallel-size 2  --nnodes 2 --node-rank 0 --master-addr 192.168.200.12 --master-port 29501
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:297] 
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.2rc1.dev7+g9c7cab5eb.d20260317
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:297]   █▄█▀ █     █     █     █  model   qwen/qwen3.5-35b-a3b-fp8
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:297] 
(APIServer pid=246) INFO 03-20 17:02:02 [utils.py:233] non-default args: {'model_tag': 'qwen/qwen3.5-35b-a3b-fp8', 'chat_template': 'unsloth.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 8888, 'model': 'qwen/qwen3.5-35b-a3b-fp8', 'max_model_len': 262144, 'load_format': 'fastsafetensors', 'attention_backend': 'flashinfer', 'master_addr': '192.168.200.12', 'nnodes': 2, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 16384}
(APIServer pid=246) WARNING 03-20 17:02:02 [envs.py:1724] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=246) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'}
(APIServer pid=246) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'}
(APIServer pid=246) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=246) INFO 03-20 17:02:12 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=246) INFO 03-20 17:02:12 [model.py:1582] Using max model len 262144
(APIServer pid=246) INFO 03-20 17:02:12 [cache.py:212] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=246) INFO 03-20 17:02:13 [arg_utils.py:1659] Inferred data_parallel_rank 0 from node_rank 0
(APIServer pid=246) INFO 03-20 17:02:13 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=16384.
(APIServer pid=246) WARNING 03-20 17:02:13 [config.py:372] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=246) INFO 03-20 17:02:13 [config.py:392] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=246) INFO 03-20 17:02:13 [config.py:212] Setting attention block size to 2096 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=246) INFO 03-20 17:02:13 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=246) INFO 03-20 17:02:13 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=453) INFO 03-20 17:02:54 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev7+g9c7cab5eb.d20260317) with config: model='qwen/qwen3.5-35b-a3b-fp8', speculative_config=None, tokenizer='qwen/qwen3.5-35b-a3b-fp8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen/qwen3.5-35b-a3b-fp8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [16384], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=453) WARNING 03-20 17:02:54 [multiproc_executor.py:997] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=453) INFO 03-20 17:02:54 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=192.168.200.12, mq_connect_ip=192.168.200.12 (local), world_size=2, local_world_size=1
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099] EngineCore failed to start.
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     super().__init__(
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 101, in __init__
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     super().__init__(vllm_config)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     self._init_executor()
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 145, in _init_executor
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     self.rpc_broadcast_mq = MessageQueue(
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]                             ^^^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 422, in __init__
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     self.remote_socket.bind(socket_addr)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/socket.py", line 320, in bind
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     super().bind(addr)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "zmq/backend/cython/_zmq.py", line 1009, in zmq.backend.cython._zmq.Socket.bind
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     _check_rc(rc)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     ^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]   File "zmq/backend/cython/_zmq.py", line 190, in zmq.backend.cython._zmq._check_rc
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     raise ZMQError(errno)
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099]     ^^^^^^^^^^^
(EngineCore pid=453) ERROR 03-20 17:02:54 [core.py:1099] zmq.error.ZMQError: Cannot assign requested address (addr='tcp://192.168.200.12:57087')
(EngineCore pid=453) Process EngineCore:
(EngineCore pid=453) Traceback (most recent call last):
(EngineCore pid=453)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=453)     self.run()
(EngineCore pid=453)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=453)     self._target(*self._args, **self._kwargs)
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=453)     raise e
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=453)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=453)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=453)     return func(*args, **kwargs)
(EngineCore pid=453)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=453)     super().__init__(
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
(EngineCore pid=453)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=453)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 101, in __init__
(EngineCore pid=453)     super().__init__(vllm_config)
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=453)     return func(*args, **kwargs)
(EngineCore pid=453)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore pid=453)     self._init_executor()
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 145, in _init_executor
(EngineCore pid=453)     self.rpc_broadcast_mq = MessageQueue(
(EngineCore pid=453)                             ^^^^^^^^^^^^^
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 422, in __init__
(EngineCore pid=453)     self.remote_socket.bind(socket_addr)
(EngineCore pid=453)   File "/usr/local/lib/python3.12/dist-packages/zmq/sugar/socket.py", line 320, in bind
(EngineCore pid=453)     super().bind(addr)
(EngineCore pid=453)   File "zmq/backend/cython/_zmq.py", line 1009, in zmq.backend.cython._zmq.Socket.bind
(EngineCore pid=453)     _check_rc(rc)
(EngineCore pid=453)     ^^^^^^^^^^^
(EngineCore pid=453)   File "zmq/backend/cython/_zmq.py", line 190, in zmq.backend.cython._zmq._check_rc
(EngineCore pid=453)     raise ZMQError(errno)
(EngineCore pid=453)     ^^^^^^^^^^^
(EngineCore pid=453) zmq.error.ZMQError: Cannot assign requested address (addr='tcp://192.168.200.12:57087')
(APIServer pid=246) Traceback (most recent call last):
(APIServer pid=246)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=246)     sys.exit(main())
(APIServer pid=246)              ^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=246)     args.dispatch_function(args)
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=246)     uvloop.run(run_server(args))
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=246)     return __asyncio.run(
(APIServer pid=246)            ^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=246)     return runner.run(main)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=246)     return self._loop.run_until_complete(task)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=246)     return await main
(APIServer pid=246)            ^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=246)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=246)     async with build_async_engine_client(
(APIServer pid=246)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=246)     return await anext(self.gen)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=246)     async with build_async_engine_client_from_engine_args(
(APIServer pid=246)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=246)     return await anext(self.gen)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=246)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=246)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=246)     return cls(
(APIServer pid=246)            ^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=246)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=246)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=246)     return func(*args, **kwargs)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=246)     return AsyncMPClient(*client_args)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=246)     return func(*args, **kwargs)
(APIServer pid=246)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=246)     super().__init__(
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=246)     with launch_core_engines(
(APIServer pid=246)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=246)     next(self.gen)
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=246)     wait_for_engine_startup(
(APIServer pid=246)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=246)     raise RuntimeError(
(APIServer pid=246) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Stopping cluster...
Stopping head node (192.168.200.12)...
Stopping worker node (192.168.200.13)...
Cluster stopped.

I’ll try to troubleshoot this by manually launching ray and attempting to get ray to communicate between two docker containers with usernamespace remapping enabled first, before trying to run vLLM.

You need to bring --network host back to launch-cluster.sh code - it won’t be able to bind to a port on a host interface otherwise.

I mean this change that you made earlier. Or just use a stock launch-cluster.sh without any changes or sparkrun.

This one.

@eugr I have managed to manually get ray head running on spark-02 and ray worker running on spark-03, with username space remapping disabled.

For a system with usernamespace-remapping enabled, this is the patch I applied to temporarily disable usernamespace remapping by passing the --usens=host argument.

diff --git a/launch-cluster.sh b/launch-cluster.sh
index f11ab11..5341d95 100755
--- a/launch-cluster.sh
+++ b/launch-cluster.sh
@@ -640,7 +641,7 @@ start_cluster() {
     fi
 
     # Build docker run arguments based on mode
-    local docker_args_common="--gpus all -d --rm --network host --name $CONTAINER_NAME $DOCKER_ARGS $IMAGE_NAME"
+    local docker_args_common="--userns=host --network host --runtime=nvidia -d --rm --name $CONTAINER_NAME $DOCKER_ARGS $IMAGE_NAME"
     local docker_caps_args=""
     local docker_resource_args=""

I am running this command from spark-02

./launch-cluster.sh \
  -t vllm-node-tf5 \
  --non-privileged \
  --nodes "192.168.200.12,192.168.200.13" \
  --eth-if enp1s0f1np1 \
  --ib-if rocep1s0f1,roceP2p1s0f1 \
  --localhost-port 8888 \
  --apply-mod mods/fix-qwen3-coder-next \
  --apply-mod mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve qwen/qwen3.5-35b-a3b-fp8 \
  --host 0.0.0.0 \
  --port 8888 \
  --max-model-len 262144 \
  --max_num_batched_tokens 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray

This appears to launch the services:

Starting Ray HEAD node on 192.168.200.12...
Starting Ray WORKER node on 192.168.200.13...
Waiting for cluster to be ready...
Cluster head is responsive.
Executing command: vllm serve qwen/qwen3.5-35b-a3b-fp8 --host 0.0.0.0 --port 8888 --max-model-len 262144 --max_num_batched_tokens 16384 --gpu-memory-utilization 0.7 --enable-auto-tool-choice --tool-call-parser qwen3_coder --kv-cache-dtype fp8 --load-format fastsafetensors --attention-backend flashinfer --enable-prefix-caching --chat-template unsloth.jinja --tensor-parallel-size 2 --distributed-executor-backend ray

Command for manually launching ray head and ray worker

Q02: How would the vllm serve command look like for the ray head node running on spark-02?

Q03: How would the vllm serve command look like for the ray worker node running on spark-03?

Do I just simply run the same command on both nodes? When manually launching ray head node, you have to pass a ray start --head argument.

Will the following command take care of launching the ray head on spark-02 and ray work on spark-03 if I run the same command inside both docker containers?

vllm serve qwen/qwen3.5-35b-a3b-fp8 --host 0.0.0.0 --port 8888 --max-model-len 262144 --max_num_batched_tokens 16384 --gpu-memory-utilization 0.7 --enable-auto-tool-choice --tool-call-parser qwen3_coder --kv-cache-dtype fp8 --load-format fastsafetensors --attention-backend flashinfer --enable-prefix-caching --chat-template unsloth.jinja --tensor-parallel-size 2 --distributed-executor-backend ray

Adding this: --network host should have solved your connectivity issue for Ray without any additional arguments.

If you can launch Ray manually, launch-cluster should be able to do it as well. If not, can you post the exact commands you run to launch those services manually and their output? I assume you launch them inside a docker container? I’m a bit confused regarding your setup.

@eugr

I am able to use your launch-cluster.sh script on a 2 node cluster with usernamespace-remapping disabled, by passing the --userns=host command.

Everything works as intended with usernamespace-remapping disabled.


I just want to know if I manually launch the docker container with the following command (with usernamespace-remapping still disabled) separately on spark-02 and spark-03

# for spark-02: temporarily disable usernamespace-remapping
export IP_NODE_02="192.168.200.12"
docker run -it \
  --userns=host \
  --network host \
  --ipc host \
  --runtime=nvidia \
  --shm-size=2gb \
  -e DISPLAY \
  -e QT_GRAPHICSSYSTEM=native \
  -e QT_X11_NO_MITSHM=1 \
  -e VLLM_HOST_IP=$IP_NODE_02 \
  -p $IP_NODE_02:8888:8888 \
  -p $IP_NODE_02:6379:6379 \
  -p $IP_NODE_02:8265:8265 \
  -p $IP_NODE_02:8076:8076 \
  -v /dev/shm:/dev/shm \
  -v /etc/localtime:/etc/localtime:ro \
  -v /tmp/.X11-unix:/tmp/.X11-unix \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket:ro \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -v $HOME/mount/project/software/infrastructure/docker/eugr/spark-vllm-docker:/tmp/spark-vllm-docker \
  --rm \
  --name vllm-node-t5-ray-head \
  vllm-node-tf5:latest bash
# for spark-03
export IP_NODE_03="192.168.200.13"
docker run -it \
  --userns=host \
  --network host \
  --ipc host \
  --runtime=nvidia \
  -e DISPLAY \
  -e QT_GRAPHICSSYSTEM=native \
  -e QT_X11_NO_MITSHM=1 \
  -e VLLM_HOST_IP=$IP_NODE_03 \
  -p $IP_NODE_03:8888:8888 \
  -p $IP_NODE_03:6379:6379 \
  -p $IP_NODE_03:10001:10001 \
  -p $IP_NODE_03:8265:8265 \
  -p $IP_NODE_03:8076:8076 \
  -p $IP_NODE_03:10002-10012:10002-10012 \
  -v /dev/shm:/dev/shm \
  -v /etc/localtime:/etc/localtime:ro \
  -v /tmp/.X11-unix:/tmp/.X11-unix \
  -v /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket:ro \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -v $HOME/mount/project/software/infrastructure/docker/eugr/spark-vllm-docker:/tmp/spark-vllm-docker \
  --rm \
  --name vllm-node-t5-ray-worker \
  vllm-node-tf5:latest bash

Do I just simply run the following command, identically on both spark-02 and spark-03?

vllm serve qwen/qwen3.5-35b-a3b-fp8 \
  --host 0.0.0.0 \
  --port 8888 \
  --max-model-len 262144 \
  --max_num_batched_tokens 16384 \
  --gpu-memory-utilization 0.7 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --kv-cache-dtype fp8 \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --chat-template unsloth.jinja \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray

i.e if I run the above command to manually launch vllm serve, how will it know which one to launch as the ray head node?

Ray doesn’t work like that. You need to launch Ray on head and worker nodes separately via ray command, and then you launch vLLM ONLY on a head node - it will connect to Ray and distribute itself across the nodes.

But just launching this is not enough. launch-cluster.sh does a few more things in the background, the most important is setting up various environment variables properly. So if you want to launch Ray manually, you will need to do it like that.

Start Ray on head node:

export VLLM_HOST_IP=192.168.177.11

export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export TP_SOCKET_IFNAME=$MN_IF_NAME
export RAY_memory_monitor_refresh_ms=0 
ray start  --head --port 6379 --node-ip-address $VLLM_HOST_IP

On worker node:

export VLLM_HOST_IP=192.168.177.12
export RAY_NODE_IP_ADDRESS=$VLLM_HOST_IP
export RAY_OVERRIDE_NODE_IP_ADDRESS=$VLLM_HOST_IP
export MN_IF_NAME=enp1s0f1np1
export UCX_NET_DEVICES=$MN_IF_NAME
export NCCL_SOCKET_IFNAME=$MN_IF_NAME
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0
export OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME
export GLOO_SOCKET_IFNAME=$MN_IF_NAME
export TP_SOCKET_IFNAME=$MN_IF_NAME
export RAY_memory_monitor_refresh_ms=0 
ray start  --address=192.168.177.11:6379 --node-ip-address $VLLM_HOST_IP