Is training on 3 DGX Spark nodes without a switch supported?

I tried to train a huggingface model using three DGX Spark nodes.

I followed this tutorial:
Connect Three Sparks in a Ring

and this forum post:
Three-node Spark clusters without a switch are now supported in spark-vllm-docker and sparkrun

I verified that nccl-tests worked correctly, and the reported average bus bandwidth was around ~24 GB/s.

After that, I tried to start distributed training with the following command:

# example for node 1
export NCCL_SOCKET_IFNAME=enP7s7
export NCCL_DEBUG=INFO
export MASTER_ADDR="192.168.23.12"
export MASTER_PORT="29500"
export WORLD_SIZE="3"
export NODE_RANK="0"

accelerate launch \
  --main_process_ip $MASTER_ADDR \
  --main_process_port $MASTER_PORT \
  --num_machines 3 \
  --machine_rank $NODE_RANK \
  --num_processes $WORLD_SIZE \
  --multi_gpu

However, the training fails because NCCL tries to communicate through NICs that are not physically connected, resulting in a connection timeout.

[rank0]: Call to ibv_modify_qp failed with 110 Connection timed out, on dev rocep1s0f0:1, curr state INIT, next state RTR, local GID index 3, local GID ::ffff:192.168.177.11, remote GID ::ffff:192.168.187.13

Is it currently impossible to run distributed training across all 3 nodes without using a switch?

Did you discover this playbook before dgx-spark-playbooks/nvidia/connect-three-sparks at main · NVIDIA/dgx-spark-playbooks · GitHub it has a useful section on netplan configuration. That’s what helped me to get two nodes running.

Edit: apologies, I see you did - it should be the netplan that needs to be correct on the three nodes - the provided scripts to discover the nodes went a bit crazy for me and required manually setting the netplan configuration.

You need to specify additional NCCL variables to make it work with 3 nodes, e.g.:

export UCX_NET_DEVICES=enP7s7
export NCCL_SOCKET_IFNAME=enP7s7
export OMPI_MCA_btl_tcp_if_include=enP7s7
export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1 
export NCCL_IB_DISABLE=0
export NCCL_IB_MERGE_NICS=0
export NCCL_NET_PLUGIN=none
export NCCL_IB_SUBNET_AWARE_ROUTING=1

The NCCL_IB_* ones are very important here, as they let the system know that you want to use Infiniband/RoCE vs. TCP/IP and want to include all subinterfaces (necessary for mesh), and that NCCL should be taking multiple subnets into account.

Thanks for the detailed explanation!

I already started training with 2 nodes for now because I was in a hurry, but once that finishes I’ll try the 3-node setup again using the environment variables you suggested.

Despite configuring the suggested NCCL environment variables, the training still fails during the initialization phase. While the nodes can successfully ping each other over the network enP7s7, NCCL fails when attempting to establish connections via RoCE.

Key error from the logs:

# Node 0 
NCCL WARN Call to ibv_modify_qp failed with 110 Connection timed out, on dev rocep1
s0f0:1, curr state INIT, next state RTR, local GID index 3, local GID ::ffff:192.168.177.11, remote GID ::ffff:192.168.187.13

# Node 1
NCCL WARN Call to ibv_modify_qp failed with 110 Connection timed out, on dev rocep1
s0f0:1, curr state INIT, next state RTR, local GID index 3, local GID ::ffff:192.168.197.12, remote GID ::ffff:192.168.177.11

# Node 2
NCCL WARN Call to ibv_modify_qp failed with 110 Connection timed out, on dev rocep1s0f1:1, curr state INIT, next state RTR, local GID index 3, local GID ::ffff:192.168.197.13, remote GID ::ffff:192.168.177.12

It appears NCCL is trying to establish a mesh through interfaces that are not physically or logically reachable. Is there a specific configuration I missed to force NCCL to use only a specific subnet, or should I reconfigure all RoCE IPs to be within the same /24 subnet?

The diagram below illustrates my current IP configuration and network topology. I followed the tutorial and forum posts, which suggested matching subnets only between physically connected NIC pairs. This is why I configured the subnets this way (e.g., matching only specific node-to-node links).

It looks like you are using standard NCCL. You need to follow up the steps here to configure with an NCCL version that supports 3-node mesh: spark-vllm-docker/docs/NETWORKING.md at main · eugr/spark-vllm-docker · GitHub, specifically this part:

# Install dependencies and build NCCL
sudo apt-get update && sudo apt-get install -y libopenmpi-dev
git clone -b dgxspark-3node-ring https://github.com/zyang-dev/nccl.git ~/nccl/
cd ~/nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"

# Set environment variables
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"

I really appreciate your guidance on this. It works!!!