NCCL single-cable test caps at 100Gbps

Hakeem_Anwar · March 4, 2026, 7:38am

Hi,

I’m connecting two Sparks with a 200GbE QSFP56 cable according to the playbooks. Connect Two Sparks works and I can see both ports are Up, but I’m having trouble with NCCL for Two Sparks.

After setting the environment variables and running the command below:
mpirun -np 2 -H 169.254.6.140:1,169.254.221.162:1 --mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x UCX_NET_DEVICES=enp1s0f0np0 -x NCCL_SOCKET_IFNAME=enp1s0f0np0 -x NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

The best avg. bus bandwidth I get is 15.9, but I thought I should be getting 20+. So far I’ve tried
-applying higher/lower MTU in the netplan (setting both to 9000 instead of 1500 makes the bandwidth worse)

adding NCCL_IB_GID_INDEX=3 (which fails bc there are two indexes at most)
adding NCCL_IB_QPS_PER_CONNECTION=8

I also tested using ib_write_bw, same issue where the highest speed is 100Gbps.

My ibdev2netdev seems fine:

ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

It’s like the other port to allow 200Gbps through one cable isn’t working properly or something. How could I fix this?

Thank you!

aniculescu · March 4, 2026, 3:57pm

You only need one cable to have 200Gbps speed. Running the playbook with a second cable will limit your bandwidth to 100Gbps

eugr · March 4, 2026, 7:25pm

Current firmware has a performance regression in ConnectX 7 workloads. We are all seeing a drop from ~24 GB/s to ~16 GB/s on NCCL tests. NVIDIA is aware and is working on a fix.

Hakeem_Anwar · March 4, 2026, 9:42pm

Yup, I’m using just one cable.

Hakeem_Anwar · March 4, 2026, 9:43pm

I see, thanks so much! Would that apply for ib_write_bw as well though?

eugr · March 4, 2026, 9:48pm

I believe so. I can test on my sparks later today.

Topic		Replies	Views
DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s DGX Spark / GB10	3	617	November 5, 2025
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	117	March 4, 2026
Why is my NCCL broken? DGX Spark / GB10	26	316	February 19, 2026
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	640	January 10, 2026
NCCL 测试带宽只有0.78GB/s 是什么情况，线缆的问题么 DGX Spark / GB10	2	125	November 6, 2025
Bonding Spark ConnectX-7 ports between two Sparks with Jumbo frames works fine DGX Spark / GB10	4	438	November 4, 2025
Suggested cable to link two Sparks? DGX Spark / GB10	77	5334	December 8, 2025
Confusion surrounding the QSFP ports and bandwidth DGX Spark / GB10	9	537	January 15, 2026
NCCL For 2 Sparks Setup - Errors? DGX Spark / GB10 spark	6	303	December 23, 2025
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	3753	December 2, 2025

NCCL single-cable test caps at 100Gbps

Related topics