NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE

itstexmex · April 11, 2026, 9:32am

NCCL caps at ~3 GB/s bus bandwidth between two separate DGX Spark FE units, roughly 8x below the 22-24 GB/s others report on similar hardware. Tracked it to sysfs reporting the GB10 GPU’s PCIe link as Gen1 x1, which NCCL’s topology cost model uses as its ring bandwidth ceiling. Raw RDMA (ib_write_bw) shows 109 Gbps on the same link, so the hardware is fine.

Hardware / software:

2x DGX Spark FE, both on SoC FW 0x0200941a, EC 0x02004e18
Kernel 6.17.0-1014-nvidia, CUDA 13.0
Driver: Spark 1 = 580.142, Spark 2 = 580.126.09 (results are identical on both, so the driver version isn’t the variable)
ConnectX-7 FW 28.45.4028, MTU 9000 on all 4 logical interfaces
Tested with NCCL 2.28.9 and 2.29.7, same result

Symptom:

ib_write_bw -d roceP2p1s0f1 -a --report_gbits → 109 Gbps OK
all_gather_perf (1-4 GiB) → 3.02 GB/s busbw

NCCL’s own topology detection on the 2.29.7 baseline run:

=== System : maxBw 12.0 totalBw 3.0 ===
Pattern 4, crossNic 0, nChannels 8, bw 3.000000/3.000000, type LOC/P2C

The GPU’s PCIe link shows as Gen1 x1 in sysfs:

$ cat /sys/bus/pci/devices/000f:01:00.0/{current,max}link{speed,width}
2.5 GT/s PCIe
2.5 GT/s PCIe
1
16

$ nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv
1, 1, 1, 16

max_link_gen is reported as 1, not just current. NCCL reads this from sysfs and its topology ceiling of totalBw 3.0 matches the 3.02 GB/s busbw result exactly.

Tried overriding with NCCL_TOPO_FILE and setting link_speed=“32.0 GT/s PCIe” on the GPU entry. NCCL’s planning responds as expected:

=== System : maxBw 12.0 totalBw 48.0 ===
Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type LOC/P2C
Pattern 3, crossNic 0, nChannels 1, bw 24.000000/12.000000, type LOC/P2C

The override works, but actual throughput is still stuck around 1.5 GB/s per NIC regardless of channel count (NCCL_MIN_NCHANNELS=16 doesn’t help). Something below NCCL is also capped.

Already ruled out:

SoC firmware regression (on current 0x0200941a, verified on both units)
MTU mismatch (9000 on all 4 logical interfaces)
GID index (tests pass with NCCL_IB_GID_INDEX set explicitly)
NCCL version (2.28.9 and 2.29.7 identical)
Driver version (different on each unit, identical results)

Separately: all four mlx5_core devices report “Write combining is not supported” in dmesg. I think that’s the Grace CPU issue the NEON-based mlx5 WC test patch addresses (submitted to LKML in Sep 2025). Not sure if that fix is in 6.17.0-1014-nvidia or if I need 6.18+.

Questions:

Is the GPU’s PCIe max_link_speed being reported as Gen1 x1 on GB10 expected, or is it a kernel/driver bug? Is there anything userspace can do to correct it?
For users getting 22-24 GB/s on similar hardware, what kernel / NCCL combo are you on?
Is the mlx5 NEON WC patch in any shipping kernel yet, or is 6.18+ required?

Can share full NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH output and the topology XML override if it helps.

Topic		Replies	Views
DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE DGX Spark / GB10	4	196	April 9, 2026
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	175	March 4, 2026
DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s DGX Spark / GB10	3	735	November 5, 2025
Terrible throughput number between 2 DGX Sparks DGX Spark / GB10	2	276	March 4, 2026
NCCL single-cable test caps at 100Gbps DGX Spark / GB10	16	313	March 31, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4471	December 2, 2025
Hardware issue DGX Spark / GB10 cuda , kernel	9	494	December 31, 2025
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	892	January 10, 2026
Nccl-test poor performance GPU-Accelerated Libraries	3	779	October 29, 2024
How can I improve the 'p2p enabled' bandwidth when testing NCCL performance with two A5000 GPU using PCIe 4.0 x16? CUDA Programming and Performance cuda	2	1293	September 15, 2023

NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE

Related topics