Ghost Node Autodiscovery on Grace Blackwell (GB10) Systems

Keyper-AI · January 31, 2026, 11:14pm

When configuring Netplan to support the PCIe “Socket Direct” splits for ConnectX-7 (CX7) adapters, NVIDIA’s autodiscovery tools (discover-sparks / build-and-copy.sh) incorrectly interpret each logical network interface as a unique physical worker node.

On GB10 systems, a single physical port is split into two logical lanes (enp1 and enP2) to handle massive bandwidth across dual PCIe domains. Because both lanes are assigned static IPs on the same subnet, the discovery script identifies the secondary lane’s IP as a separate peer, resulting in a “phantom” 4-node cluster configuration for a 2-node physical setup.

Steps to Reproduce

Configure Netplan with distinct IPs for both Socket Direct lanes (enp1... and enP2...) on each node.
Execute the spark build script:

Bash
```
bash ./discover-sparks
```

Observed Behavior

The autodiscovery log shows that the script scans the subnet and treats the secondary interface IPs (.14 and .15) as independent cluster nodes:

Plaintext

Found: 192.168.100.15 (spark-2.local)
Found: 192.168.100.10 (spark-1.local)
Found: 192.168.100.15 (spark-2.local)
Found: 192.168.100.14 (spark-1.local)

Setting up shared SSH access across all nodes...
You may be prompted for your password on each node.
Configuring 192.168.100.10...
  ✓ Successfully configured 192.168.100.10 with shared key
Configuring 192.168.100.14...
  ✓ Successfully configured 192.168.100.14 with shared key
Configuring 192.168.100.15...
  ✓ Successfully configured 192.168.100.15 with shared key

Current Netplan Configuration

YAML

network:
  version: 2
  ethernets:
    # Port 1 - Lane A
    enp1s0f0np0:
      addresses: [- 192.168.100.10/24]
    # Port 1 - Lane B (Socket Direct Split)
    enP2p1s0f0np0:
      addresses: [- 192.168.100.14/24]
    # Port 2 - Lane A
    enp1s0f1np1:
      addresses: [- 192.168.200.12/24]
    # Port 2 - Lane B (Socket Direct Split)
    enP2p1s0f1np1:
      addresses: [- 192.168.200.16/24]

Impact

This misidentification causes MPI and Ray workloads to attempt a 4-way split of the model across only 2 physical GPUs, leading to immediate initialization failures or severe memory over-subscription.

Keyper-AI · January 31, 2026, 11:26pm

On one walk through it says to disregard anything with enP2p and on another one it says to actually use those in the setup.

Saying to use enP2p: Spark Stacking — DGX Spark User Guide

Saying to ignore it: Connect Two Sparks | DGX Spark

Clarification on the specific configurations to use would be helpful.

elsaco · February 1, 2026, 12:14am

@Keyper-AI use nvidia-smi topo -m to see your splits.

eugr · February 1, 2026, 7:05am

If you assign IPs to both “halves”, do not assign IPs from the same subnet! This is not a good practice in networking in general, and it will just mess up a routing table and won’t play well with many tools.

You can either safely ignore the one with capital “P” or assign an IP from a different subnet. For any NCCL tasks you will use RDMA anyway, so you just need one interface as a control one, and you will use RoCE/IB interfaces for actual workload.

Topic		Replies	Views
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	2734	December 2, 2025
Why is my NCCL broken? DGX Spark / GB10	25	159	February 5, 2026
Successful 2 DGX Spark cluster setup? DGX Spark / GB10	12	2028	October 21, 2025
NCCL For 2 Sparks Setup - Errors? DGX Spark / GB10 spark	6	199	December 23, 2025
Network issues when connecting two DGX Spark systems via QSFP using “Connect Two Sparks” playbook DGX Spark / GB10 dgx	3	144	January 26, 2026
Suggested cable to link two Sparks? DGX Spark / GB10	77	3843	December 8, 2025
NCCL/RoCEv2 Issues with Duplicates, GPU Direct RDMA, and Fusion DGX Spark / GB10	3	181	November 16, 2025
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	352	January 10, 2026
DGX Spark gains a new ipv6 address every time Sync connects? DGX Spark / GB10	12	214	November 6, 2025
Three GB10 in "Triangle" configuration DGX Spark / GB10 Projects	4	553	January 2, 2026

Ghost Node Autodiscovery on Grace Blackwell (GB10) Systems

Steps to Reproduce

Observed Behavior

Current Netplan Configuration

Impact

Related topics