When configuring Netplan to support the PCIe “Socket Direct” splits for ConnectX-7 (CX7) adapters, NVIDIA’s autodiscovery tools (discover-sparks / build-and-copy.sh) incorrectly interpret each logical network interface as a unique physical worker node.
On GB10 systems, a single physical port is split into two logical lanes (enp1 and enP2) to handle massive bandwidth across dual PCIe domains. Because both lanes are assigned static IPs on the same subnet, the discovery script identifies the secondary lane’s IP as a separate peer, resulting in a “phantom” 4-node cluster configuration for a 2-node physical setup.
Steps to Reproduce
-
Configure Netplan with distinct IPs for both Socket Direct lanes (
enp1...andenP2...) on each node. -
Execute the spark build script:
Bash
bash ./discover-sparks
Observed Behavior
The autodiscovery log shows that the script scans the subnet and treats the secondary interface IPs (.14 and .15) as independent cluster nodes:
Plaintext
Found: 192.168.100.15 (spark-2.local)
Found: 192.168.100.10 (spark-1.local)
Found: 192.168.100.15 (spark-2.local)
Found: 192.168.100.14 (spark-1.local)
Setting up shared SSH access across all nodes...
You may be prompted for your password on each node.
Configuring 192.168.100.10...
✓ Successfully configured 192.168.100.10 with shared key
Configuring 192.168.100.14...
✓ Successfully configured 192.168.100.14 with shared key
Configuring 192.168.100.15...
✓ Successfully configured 192.168.100.15 with shared key
Current Netplan Configuration
YAML
network:
version: 2
ethernets:
# Port 1 - Lane A
enp1s0f0np0:
addresses: [- 192.168.100.10/24]
# Port 1 - Lane B (Socket Direct Split)
enP2p1s0f0np0:
addresses: [- 192.168.100.14/24]
# Port 2 - Lane A
enp1s0f1np1:
addresses: [- 192.168.200.12/24]
# Port 2 - Lane B (Socket Direct Split)
enP2p1s0f1np1:
addresses: [- 192.168.200.16/24]
Impact
This misidentification causes MPI and Ray workloads to attempt a 4-way split of the model across only 2 physical GPUs, leading to immediate initialization failures or severe memory over-subscription.