Dual DGX Spark: NCCL capped at 2.80 GB/s + ib_write_bw crashes at 128KB syndrom 0x88 — matches thread 366266 with additional RoCE degradation

TL;DR

Dual DGX Spark FE cluster, 200GbE QSFP56 DAC, NCCL capped at 2.80 GB/s busbw vs 22-24 GB/s target. PCIe sysfs reports Gen1 x1 (matches thread https://forums.developer.nvidia.com/t/nccl-bandwidth-capped-at-3-gb-s-gpu-pcie-topology-reports-gen1-x1-on-dgx-spark-fe/366266). Additionally, my ib_write_bw peaks at only 13.5 Gbps and crashes at 128KB with syndrom 0x88 — significantly worse than the 109 Gbps reported by @itstexmex on same stack.

Purchased 2x Spark specifically for 405B distributed inference. Current performance makes this use case non-viable.

HARDWARE AND SOFTWARE

  • 2x DGX Spark Founders Edition (hostnames spark-4bf5 and spark-8569)

  • QSFP56 200GbE DAC, single cable, interface enp1s0f1np1 on both nodes, MTU 9000

  • DGX OS 7.5.0, kernel 6.17.0-1014-nvidia (aarch64)

  • NCCL 2.28.9 built from source with NVCC_GENCODE=“-gencode=arch=compute_121,code=sm_121”

  • CUDA 13.0

  • OpenMPI system package (libopenmpi-dev)

SYMPTOM 1 — NCCL 2.80 GB/s (8x below target)

Following the official NCCL Stacked Sparks playbook (https://build.nvidia.com/spark/nccl/stacked-sparks):

mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1
–mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no”
-x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=enp1s0f1np1
-x OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

Result: algbw 5.61 GB/s, busbw 2.80 GB/s. Reproduced twice, identical.

Debug log confirms NET/IB (RoCE) transport is used correctly (not TCP sockets). GPU Direct RDMA correctly disabled per NVIDIA policy for UMA architecture.

SYMPTOM 2 — PCIe sysfs Gen1 x1 misreporting

$ nvidia-smi --query-gpu=pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv pcie.link.gen.current, pcie.link.width.current, pcie.link.gen.max, pcie.link.width.max 1, 1, 1, 16

Identical to thread 366266. Both current AND max are Gen1 x1. Expected Gen5 x16 for GB10 Blackwell. NCCL uses this value in its topology cost model, which caps ring bandwidth at approximately 3 GB/s — matching my busbw exactly.

SYMPTOM 3 — ib_write_bw underperforming and crashing

$ ib_write_bw -d rocep1s0f1 192.168.100.11 -a --report_gbits … 512 bytes: 11.25 Gb/s avg 1024 bytes: 12.68 Gb/s avg 4096 bytes: 13.37 Gb/s avg 16384 bytes: 13.47 Gb/s avg 65536 bytes: 13.47 Gb/s avg 131072 bytes: CRASH Completion with error at client Failed status 10: wr_id 0 syndrom 0x88 scnt=128, ccnt=0

Peak 13.50 Gbps observed. Thread 366266 reports 109 Gbps on same hardware, same software stack. My RoCE path appears to have additional degradation beyond the documented PCIe sysfs bug.

Retry with rdma_cm + 4 QPs + 1MB fixed messages:

ib_write_bw -d rocep1s0f1 192.168.100.11 --report_gbits -q 4 --connection=RC -R -s 1048576 -D 10

Immediate crash with same syndrom 0x88, zero data transferred.

ALREADY RULED OUT

  • Physical link healthy: Speed 200000Mb/s, Duplex Full, Link detected yes

  • Zero CRC errors (rx_crc_errors_phy=0)

  • Zero discards on any priority (rx_prio*_discards=0)

  • Zero pause frame events

  • PFC tuned per Mellanox best practice: priority 3 enabled=1, buffer=1, ring 8192/8192, global pause disabled, trust mode DSCP — no improvement

  • MTU 9000 confirmed on both nodes

  • NCCL correctly uses NET/IB (RoCE) transport, not TCP sockets

  • SSH passwordless bidirectional verified before testing

QUESTIONS FOR NVIDIA

  1. Is the PCIe sysfs Gen1 x1 misreporting a known regression in kernel 6.17.0-1014-nvidia / DGX OS 7.5.0? ETA for fix?

  2. For users achieving 22-24 GB/s busbw on the same hardware class reported in earlier threads — what kernel/NCCL/firmware combination works today? Is there a supported downgrade path?

  3. Is the mlx5 NEON Write Combining patch (submitted to LKML September 2025) included in any shipping DGX OS release? If not, which release/kernel version will include it?

  4. My ib_write_bw crash at 128KB with syndrom 0x88 is worse than the 109 Gbps reported in thread 366266 on the same stack. Same root cause, or additional issue requiring separate investigation?

  5. Given that dual-Spark clustering for 405B inference was the primary purchase justification (per NVIDIA marketing), what is the escalation path if a software fix does not arrive in a reasonable timeframe?

EVIDENCE PACKAGE

Attached nvidia-ticket-evidence-20260418.tar.gz contains:

  • system-info.txt, link-info.txt, link-stats.txt

  • pause-config.txt, qos-config.txt, ring-config.txt

  • pcie-current-speed.txt, pcie-max-speed.txt, pcie-current-width.txt, pcie-max-width.txt

  • rdma-device.txt (ibv_devinfo -v full output)

  • dmesg-network.txt (mlx5, roce, nvidia kernel messages)

  • nccl-test-result.txt (reproduction commands and full output)

    nvidia-ticket-evidence-20260418.tar.gz (17.2 KB)

  • ib-write-bw-result.txt (test results including crash evidence)

  • public-thread-reference.txt (correlation with thread 366266)

Happy to provide additional diagnostics, rerun with verbose NCCL/dmesg logging, or participate in remote debugging sessions.

Have you read the thread that you linked? The user was able to solve their issue so please try their steps first.