Ghost Node Autodiscovery on Grace Blackwell (GB10) Systems

When configuring Netplan to support the PCIe “Socket Direct” splits for ConnectX-7 (CX7) adapters, NVIDIA’s autodiscovery tools (discover-sparks / build-and-copy.sh) incorrectly interpret each logical network interface as a unique physical worker node.

On GB10 systems, a single physical port is split into two logical lanes (enp1 and enP2) to handle massive bandwidth across dual PCIe domains. Because both lanes are assigned static IPs on the same subnet, the discovery script identifies the secondary lane’s IP as a separate peer, resulting in a “phantom” 4-node cluster configuration for a 2-node physical setup.

Steps to Reproduce

  1. Configure Netplan with distinct IPs for both Socket Direct lanes (enp1... and enP2...) on each node.

  2. Execute the spark build script:

    Bash

    bash ./discover-sparks
    
    

Observed Behavior

The autodiscovery log shows that the script scans the subnet and treats the secondary interface IPs (.14 and .15) as independent cluster nodes:

Plaintext

Found: 192.168.100.15 (spark-2.local)
Found: 192.168.100.10 (spark-1.local)
Found: 192.168.100.15 (spark-2.local)
Found: 192.168.100.14 (spark-1.local)

Setting up shared SSH access across all nodes...
You may be prompted for your password on each node.
Configuring 192.168.100.10...
  ✓ Successfully configured 192.168.100.10 with shared key
Configuring 192.168.100.14...
  ✓ Successfully configured 192.168.100.14 with shared key
Configuring 192.168.100.15...
  ✓ Successfully configured 192.168.100.15 with shared key


Current Netplan Configuration

YAML

network:
  version: 2
  ethernets:
    # Port 1 - Lane A
    enp1s0f0np0:
      addresses: [- 192.168.100.10/24]
    # Port 1 - Lane B (Socket Direct Split)
    enP2p1s0f0np0:
      addresses: [- 192.168.100.14/24]
    # Port 2 - Lane A
    enp1s0f1np1:
      addresses: [- 192.168.200.12/24]
    # Port 2 - Lane B (Socket Direct Split)
    enP2p1s0f1np1:
      addresses: [- 192.168.200.16/24]

Impact

This misidentification causes MPI and Ray workloads to attempt a 4-way split of the model across only 2 physical GPUs, leading to immediate initialization failures or severe memory over-subscription.

On one walk through it says to disregard anything with enP2p and on another one it says to actually use those in the setup.

Saying to use enP2p: Spark Stacking — DGX Spark User Guide

Saying to ignore it: Connect Two Sparks | DGX Spark

Clarification on the specific configurations to use would be helpful.

@Keyper-AI use nvidia-smi topo -m to see your splits.

If you assign IPs to both “halves”, do not assign IPs from the same subnet! This is not a good practice in networking in general, and it will just mess up a routing table and won’t play well with many tools.

You can either safely ignore the one with capital “P” or assign an IP from a different subnet. For any NCCL tasks you will use RDMA anyway, so you just need one interface as a control one, and you will use RoCE/IB interfaces for actual workload.