NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE

NCCL caps at ~3 GB/s bus bandwidth between two separate DGX Spark FE units, roughly 8x below the 22-24 GB/s others report on similar hardware. Tracked it to sysfs reporting the GB10 GPU’s PCIe link as Gen1 x1, which NCCL’s topology cost model uses as its ring bandwidth ceiling. Raw RDMA (ib_write_bw) shows 109 Gbps on the same link, so the hardware is fine.

Hardware / software:

  • 2x DGX Spark FE, both on SoC FW 0x0200941a, EC 0x02004e18
  • Kernel 6.17.0-1014-nvidia, CUDA 13.0
  • Driver: Spark 1 = 580.142, Spark 2 = 580.126.09 (results are identical on both, so the driver version isn’t the variable)
  • ConnectX-7 FW 28.45.4028, MTU 9000 on all 4 logical interfaces
  • Tested with NCCL 2.28.9 and 2.29.7, same result

Symptom:

ib_write_bw -d roceP2p1s0f1 -a --report_gbits → 109 Gbps OK
all_gather_perf (1-4 GiB) → 3.02 GB/s busbw

NCCL’s own topology detection on the 2.29.7 baseline run:

=== System : maxBw 12.0 totalBw 3.0 ===
Pattern 4, crossNic 0, nChannels 8, bw 3.000000/3.000000, type LOC/P2C

The GPU’s PCIe link shows as Gen1 x1 in sysfs:

$ cat /sys/bus/pci/devices/000f:01:00.0/{current,max}link{speed,width}
2.5 GT/s PCIe
2.5 GT/s PCIe
1
16

$ nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv
1, 1, 1, 16

max_link_gen is reported as 1, not just current. NCCL reads this from sysfs and its topology ceiling of totalBw 3.0 matches the 3.02 GB/s busbw result exactly.

Tried overriding with NCCL_TOPO_FILE and setting link_speed=“32.0 GT/s PCIe” on the GPU entry. NCCL’s planning responds as expected:

=== System : maxBw 12.0 totalBw 48.0 ===
Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/P2C
Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/P2C

The override works, but actual throughput is still stuck around 1.5 GB/s per NIC regardless of channel count (NCCL_MIN_NCHANNELS=16 doesn’t help). Something below NCCL is also capped.

Already ruled out:

  • SoC firmware regression (on current 0x0200941a, verified on both units)
  • MTU mismatch (9000 on all 4 logical interfaces)
  • GID index (tests pass with NCCL_IB_GID_INDEX set explicitly)
  • NCCL version (2.28.9 and 2.29.7 identical)
  • Driver version (different on each unit, identical results)

Separately: all four mlx5_core devices report “Write combining is not supported” in dmesg. I think that’s the Grace CPU issue the NEON-based mlx5 WC test patch addresses (submitted to LKML in Sep 2025). Not sure if that fix is in 6.17.0-1014-nvidia or if I need 6.18+.

Questions:

  1. Is the GPU’s PCIe max_link_speed being reported as Gen1 x1 on GB10 expected, or is it a kernel/driver bug? Is there anything userspace can do to correct it?
  2. For users getting 22-24 GB/s on similar hardware, what kernel / NCCL combo are you on?
  3. Is the mlx5 NEON WC patch in any shipping kernel yet, or is 6.18+ required?

Can share full NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH output and the topology XML override if it helps.

The PCIe link reported as Gen1 x1 because it communicates with the CPU through a C2C link for high speed data, not the PCIe lane.
To better triage your issue, can you send an nvidia-bug-report and the full output with NCCL_DEBUG=INFO

Thanks for the confirmation on the C2C path — that matches what we see in nvidia-smi topo -m (C2C at 43125 MB/s) and clears up why sysfs reports Gen1 x1 for the GPU endpoint. Also saw the same explanation from elsaco on the thread titled “PCIe Link Running at Gen1 x1 Instead of Expected Speed – DGX-Spark” (362025), so that part we’ll treat as expected.

The remaining issue is the ~1.5 GB/s per-NIC cap on inter-node NCCL throughput — even after a topology XML override lifts NCCL’s planned totalBw from 3.0 to 48.0, measured busbw doesn’t move.

Artifacts

Both captured on two DGX Spark FE units, kernel 6.17.0-1014-nvidia, driver 580.142, NCCL 2.29.7, CX7 FW 28.45.4028, MTU 9000, with one CX7 port on each card cabled between the nodes (two cables total). EC firmware 0x02004e18, UEFI device firmware 0x0200941a (SoC) and 0x00000507.

nvidia-bug-report.log.gz — attached for each node (visage / maximus).

NCCL all_gather_perf output (2 nodes × 1 GPU, -b 1G -e 4G -f 2), full NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,TUNING,ENV:

Run NCCL plan Measured busbw
Baseline (no topology override) maxBw 12.0 totalBw 3.0 on both nodes 1.19 GB/s
With NCCL_TOPO_FILE=nccl_topo_fixed.xml visage totalBw 48.0, maximus totalBw 3.0 1.29 GB/s

(Maximus still plans totalBw 3.0 with the override because the XML’s host_hash is visage-specific — we’ll regenerate that separately.)

Transport is TCP sockets — we’ve been running with NCCL_IB_DISABLE=1 + NCCL_NET_PLUGIN=none because the NGC-container-bundled AWS OFI NCCL plugin attempts DMABUF GDR on GB10 unified memory and fails with ibv_reg_mr_iova2 failed. Happy to re-run with the plugin re-enabled if useful; raw ib_write_bw on the same link shows 109 Gbps, so the RDMA path is healthy at the hardware level.

Env vars forwarded to both ranks via mpirun -x:

NCCL_SOCKET_IFNAME=enP2p1s0f1np1
NCCL_IB_HCA=roceP2p1s0f1
NCCL_IB_DISABLE=1
NCCL_IB_GID_INDEX=3
NCCL_NET_PLUGIN=none
NCCL_IB_PCI_RELAXED_ORDERING=1
NCCL_IB_MERGE_NICS=1
NCCL_CROSS_NIC=1
NCCL_ALGO=Ring
NCCL_PROTO=Simple
NCCL_NET_GDR_LEVEL=0
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,TUNING,ENV

Two related threads I’d like to ask about

  1. In “DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s” (350077) you mentioned the playbook is designed for a single cable and sakshamconsul hit 22.1 GB/s after disconnecting one. We have two cables (one port on each CX7 card). Is dual-cable between two Sparks expected to cap throughput, and is there a supported multi-NIC config we should be using instead?
  2. In “One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth” (360591), a SoC firmware regression dropped NCCL to ~16 GB/s with a fix shipping March 12 2026 for Founder’s Edition. Our UEFI device firmware is 0x0200941a + 0x00000507. Is our current firmware on the fixed branch, and is there an fwupdmgr update we should be pulling?

Happy to provide any other captures you’d like.
all_gather_baseline.log (48.2 KB)
all_gather_topofix.log (39.4 KB)
Spark 1:
nvidia-bug-report.log.gz (507.0 KB)
system-state.txt (2.0 KB)
Spark 2:
nvidia-bug-report.log.gz (438.3 KB)
system-state.txt (1.7 KB)

  1. They published playbook is designed for one port connected. You can run with both ports with different configuration and you should get similar speed.
  2. Running ethtool -i enp1s0f1np1 should show
driver: mlx5_core
version: 6.17.0-1014-nvidia
firmware-version: 28.45.4028 (NVD0000000087)

Please confirm you have this firmware version to eliminate the FW regression bug.

I will review your logs too

@aniculescu — firmware check came back clean. All four mlx5 interfaces across both Sparks report:

driver: mlx5_core
version: 6.17.0-1014-nvidia
firmware-version: 28.45.4028 (NVD0000000087)

Exact match to what you specified, so we can rule out the regression.

While we were running diagnostics we also made progress on the bandwidth itself. The block turned out to be the bundled AWS OFI NCCL plugin — it was failing on ibv_reg_mr_iova2 (DMABUF GDR on GB10’s unified memory) regardless of our NCCL_IB_DISABLE setting, which forced us down to TCP. Setting NCCL_NET_PLUGIN=none disables the plugin entirely and lets NCCL fall back to its built-in IB transport, and from there everything works.

Numbers from all_gather_perf -b 1G -e 4G -f 2 (2 nodes × 1 GPU, NCCL 2.29.7):

  • Before (TCP, 1 NIC): 1.19 GB/s
  • After (NET/IB, 1 NIC): 13.19 GB/s
  • After (NET/IB, 2 NICs merged via NCCL_IB_MERGE_NICS=1): 20.58 GB/s avg, 21.11 peak at 4 GB

Two follow-ups on that:

First, you mentioned there’s a different configuration to use when both ports are connected. We’re currently running with one port of each CX7 card cabled to its peer on the other Spark (rocep1s0f1 + roceP2p1s0f1). Is that the layout the dual-port config is designed for, and is there a specific playbook we should be following? Want to lock in the right setup before we move on to vLLM TP=2.

Second: journalctl -k on both nodes shows mlx5_core_test_wc: Write combining is not supported on all four mlx5 devices, and an ib_write_bw sweep confirms the small-message WC fingerprint (0.12 Gb/s at 2 bytes ramping up to 109 Gb/s at 64 KB+). Bulk NCCL traffic isn’t affected so it’s not blocking us, but flagging it in case the NEON WC patch is relevant to the ongoing investigation on your end.

I’m glad you’re performance got better.

  1. You’re setup is fine as-is and you should get max bandwidth. Since you have two ports you can connect a second cable but it won’t really help performance, mostly just for redundancy, or if you want to connect to a third system.
  2. Thanks for pointing that out, but I think it is just a warning message.