Dual DGX Spark: NCCL capped at 2.80 GB/s + ib_write_bw crashes at 128KB syndrom 0x88 — matches thread 366266 with additional RoCE degradation

stefanjustin100 · April 18, 2026, 7:56pm

TL;DR

Dual DGX Spark FE cluster, 200GbE QSFP56 DAC, NCCL capped at 2.80 GB/s busbw vs 22-24 GB/s target. PCIe sysfs reports Gen1 x1 (matches thread https://forums.developer.nvidia.com/t/nccl-bandwidth-capped-at-3-gb-s-gpu-pcie-topology-reports-gen1-x1-on-dgx-spark-fe/366266). Additionally, my ib_write_bw peaks at only 13.5 Gbps and crashes at 128KB with syndrom 0x88 — significantly worse than the 109 Gbps reported by @itstexmex on same stack.

Purchased 2x Spark specifically for 405B distributed inference. Current performance makes this use case non-viable.

HARDWARE AND SOFTWARE

2x DGX Spark Founders Edition (hostnames spark-4bf5 and spark-8569)
QSFP56 200GbE DAC, single cable, interface enp1s0f1np1 on both nodes, MTU 9000
DGX OS 7.5.0, kernel 6.17.0-1014-nvidia (aarch64)
NCCL 2.28.9 built from source with NVCC_GENCODE=“-gencode=arch=compute_121,code=sm_121”
CUDA 13.0
OpenMPI system package (libopenmpi-dev)

SYMPTOM 1 — NCCL 2.80 GB/s (8x below target)

Following the official NCCL Stacked Sparks playbook (https://build.nvidia.com/spark/nccl/stacked-sparks):

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no”
-x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=enp1s0f1np1
-x OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

Result: algbw 5.61 GB/s, busbw 2.80 GB/s. Reproduced twice, identical.

Debug log confirms NET/IB (RoCE) transport is used correctly (not TCP sockets). GPU Direct RDMA correctly disabled per NVIDIA policy for UMA architecture.

SYMPTOM 2 — PCIe sysfs Gen1 x1 misreporting

$ nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv pcie.link.gen.current, pcie.link.width.current, pcie.link.gen.max, pcie.link.width.max 1, 1, 1, 16

Identical to thread 366266. Both current AND max are Gen1 x1. Expected Gen5 x16 for GB10 Blackwell. NCCL uses this value in its topology cost model, which caps ring bandwidth at approximately 3 GB/s — matching my busbw exactly.

SYMPTOM 3 — ib_write_bw underperforming and crashing

$ ib_write_bw -d rocep1s0f1 192.168.100.11 -a --report_gbits … 512 bytes: 11.25 Gb/s avg 1024 bytes: 12.68 Gb/s avg 4096 bytes: 13.37 Gb/s avg 16384 bytes: 13.47 Gb/s avg 65536 bytes: 13.47 Gb/s avg 131072 bytes: CRASH Completion with error at client Failed status 10: wr_id 0 syndrom 0x88 scnt=128, ccnt=0

Peak 13.50 Gbps observed. Thread 366266 reports 109 Gbps on same hardware, same software stack. My RoCE path appears to have additional degradation beyond the documented PCIe sysfs bug.

Retry with rdma_cm + 4 QPs + 1MB fixed messages:

ib_write_bw -d rocep1s0f1 192.168.100.11 --report_gbits -q 4 --connection=RC -R -s 1048576 -D 10

Immediate crash with same syndrom 0x88, zero data transferred.

ALREADY RULED OUT

Physical link healthy: Speed 200000Mb/s, Duplex Full, Link detected yes
Zero CRC errors (rx_crc_errors_phy=0)
Zero discards on any priority (rx_prio*_discards=0)
Zero pause frame events
PFC tuned per Mellanox best practice: priority 3 enabled=1, buffer=1, ring 8192/8192, global pause disabled, trust mode DSCP — no improvement
MTU 9000 confirmed on both nodes
NCCL correctly uses NET/IB (RoCE) transport, not TCP sockets
SSH passwordless bidirectional verified before testing

QUESTIONS FOR NVIDIA

Is the PCIe sysfs Gen1 x1 misreporting a known regression in kernel 6.17.0-1014-nvidia / DGX OS 7.5.0? ETA for fix?
For users achieving 22-24 GB/s busbw on the same hardware class reported in earlier threads — what kernel/NCCL/firmware combination works today? Is there a supported downgrade path?
Is the mlx5 NEON Write Combining patch (submitted to LKML September 2025) included in any shipping DGX OS release? If not, which release/kernel version will include it?
My ib_write_bw crash at 128KB with syndrom 0x88 is worse than the 109 Gbps reported in thread 366266 on the same stack. Same root cause, or additional issue requiring separate investigation?
Given that dual-Spark clustering for 405B inference was the primary purchase justification (per NVIDIA marketing), what is the escalation path if a software fix does not arrive in a reasonable timeframe?

EVIDENCE PACKAGE

Attached nvidia-ticket-evidence-20260418.tar.gz contains:

system-info.txt, link-info.txt, link-stats.txt
pause-config.txt, qos-config.txt, ring-config.txt
pcie-current-speed.txt, pcie-max-speed.txt, pcie-current-width.txt, pcie-max-width.txt
rdma-device.txt (ibv_devinfo -v full output)
dmesg-network.txt (mlx5, roce, nvidia kernel messages)
nccl-test-result.txt (reproduction commands and full output)

nvidia-ticket-evidence-20260418.tar.gz (17.2 KB)
ib-write-bw-result.txt (test results including crash evidence)
public-thread-reference.txt (correlation with thread 366266)

Happy to provide additional diagnostics, rerun with verbose NCCL/dmesg logging, or participate in remote debugging sessions.

aniculescu · April 20, 2026, 5:02pm

Have you read the thread that you linked? The user was able to solve their issue so please try their steps first.

Topic		Replies	Views
NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE DGX Spark / GB10 pcie , kernel , performance , debugging-and-troubleshooting , nics , rdma	5	203	April 14, 2026
DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE DGX Spark / GB10	4	241	April 9, 2026
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	196	March 4, 2026
Terrible throughput number between 2 DGX Sparks DGX Spark / GB10	2	309	March 4, 2026
DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s DGX Spark / GB10	3	770	November 5, 2025
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4721	December 2, 2025
One of Four DGX Sparks Shows ~35% Lower NCCL Bandwidth — Can't Figure Out Why DGX Spark / GB10	50	1365	March 31, 2026
NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable DGX Spark / GB10 spark , nics , dgx	9	254	April 19, 2026
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	968	January 10, 2026
Can't stack DGX Sparks - HELP DGX Spark / GB10	10	256	April 21, 2026

Dual DGX Spark: NCCL capped at 2.80 GB/s + ib_write_bw crashes at 128KB syndrom 0x88 — matches thread 366266 with additional RoCE degradation

Related topics