NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE

NCCL caps at ~3 GB/s bus bandwidth between two separate DGX Spark FE units, roughly 8x below the 22-24 GB/s others report on similar hardware. Tracked it to sysfs reporting the GB10 GPU’s PCIe link as Gen1 x1, which NCCL’s topology cost model uses as its ring bandwidth ceiling. Raw RDMA (ib_write_bw) shows 109 Gbps on the same link, so the hardware is fine.

Hardware / software:

  • 2x DGX Spark FE, both on SoC FW 0x0200941a, EC 0x02004e18
  • Kernel 6.17.0-1014-nvidia, CUDA 13.0
  • Driver: Spark 1 = 580.142, Spark 2 = 580.126.09 (results are identical on both, so the driver version isn’t the variable)
  • ConnectX-7 FW 28.45.4028, MTU 9000 on all 4 logical interfaces
  • Tested with NCCL 2.28.9 and 2.29.7, same result

Symptom:

ib_write_bw -d roceP2p1s0f1 -a --report_gbits → 109 Gbps OK
all_gather_perf (1-4 GiB) → 3.02 GB/s busbw

NCCL’s own topology detection on the 2.29.7 baseline run:

=== System : maxBw 12.0 totalBw 3.0 ===
Pattern 4, crossNic 0, nChannels 8, bw 3.000000/3.000000, type LOC/P2C

The GPU’s PCIe link shows as Gen1 x1 in sysfs:

$ cat /sys/bus/pci/devices/000f:01:00.0/{current,max}link{speed,width}
2.5 GT/s PCIe
2.5 GT/s PCIe
1
16

$ nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv
1, 1, 1, 16

max_link_gen is reported as 1, not just current. NCCL reads this from sysfs and its topology ceiling of totalBw 3.0 matches the 3.02 GB/s busbw result exactly.

Tried overriding with NCCL_TOPO_FILE and setting link_speed=“32.0 GT/s PCIe” on the GPU entry. NCCL’s planning responds as expected:

=== System : maxBw 12.0 totalBw 48.0 ===
Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/P2C
Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/P2C

The override works, but actual throughput is still stuck around 1.5 GB/s per NIC regardless of channel count (NCCL_MIN_NCHANNELS=16 doesn’t help). Something below NCCL is also capped.

Already ruled out:

  • SoC firmware regression (on current 0x0200941a, verified on both units)
  • MTU mismatch (9000 on all 4 logical interfaces)
  • GID index (tests pass with NCCL_IB_GID_INDEX set explicitly)
  • NCCL version (2.28.9 and 2.29.7 identical)
  • Driver version (different on each unit, identical results)

Separately: all four mlx5_core devices report “Write combining is not supported” in dmesg. I think that’s the Grace CPU issue the NEON-based mlx5 WC test patch addresses (submitted to LKML in Sep 2025). Not sure if that fix is in 6.17.0-1014-nvidia or if I need 6.18+.

Questions:

  1. Is the GPU’s PCIe max_link_speed being reported as Gen1 x1 on GB10 expected, or is it a kernel/driver bug? Is there anything userspace can do to correct it?
  2. For users getting 22-24 GB/s on similar hardware, what kernel / NCCL combo are you on?
  3. Is the mlx5 NEON WC patch in any shipping kernel yet, or is 6.18+ required?

Can share full NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH output and the topology XML override if it helps.