DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s

sakshamconsul · November 3, 2025, 5:34pm

Hi everyone,

I followed the playbook on the NVIDIA guide (Try NVIDIA NIM APIs), but am noticing quite a bit slower bandwidth. I am currently at 41% utilization

export PORT_NAME=enp1s0f0np0
export UCX_NET_DEVICES=$PORT_NAME
export NCCL_SOCKET_IFNAME=$PORT_NAME
export OMPI_MCA_btl_tcp_if_include=$PORT_NAME

export DEVICE_1_IP=169.254.155.221export DEVICE_2_IP=169.254.174.230

mpirun -np 2 -H $DEVICE_1_IP:1,$DEVICE_2_IP:1 –mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,NET $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

Results:

# nccl-tests version 2.17.6 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  96122 on prior-node device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  28364 on posterior-node device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   839441   20.47   10.23       0   836266   20.54   10.27       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 10.2523
#
# Collective test concluded: all_gather_perf

When I ran with NCCL_DEBUG I see:

NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 ‘rocep1s0f0’
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 ‘rocep1s0f1’
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 ‘roceP2p1s0f0’
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 ‘roceP2p1s0f1’

ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)

How do I increase my bandwidth utilization? I would have figured enabling GDR, but with the new unified CPU-GPU architecture, I am not sure if that is required.

Thank you

aniculescu · November 3, 2025, 6:00pm

Hi,
It looks like you have both CX-7 ports connected via cable, but the playbook is designed for 1 port only. Having both ports will bump down speed to 100 Gbps. The bandwidth you are seeing is around 80Gbps so this is consistent.

sakshamconsul · November 5, 2025, 5:54pm

Ah great, disconnecting one of the cables gave me a more expected result

Warning: Permanently added '169.254.174.230' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.6 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 108332 on prior-node device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  42734 on posterior-node device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   391017   43.94   21.97       0   386256   44.48   22.24       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 22.1036
#
# Collective test concluded: all_gather_perf

Is there a way to set it so that the second port is more of a backup/failover rather than default splitting the bandwidth by 2?

Either that, or being able to use both connections for a NCCL test simultaneously? Thinking from a reliability stand point, dual connections would be preferred, than a single one.

aniculescu · November 5, 2025, 10:52pm

We are updating our documentation on how to do this. Stay tuned

Topic		Replies	Views
DGX Spark NCCL Test: 15GB/s So Slow DGX Spark / GB10	1	197	March 4, 2026
DGX Spark ↔ EdgeXpert NCCL only ~17 GB/s over 200GbE DGX Spark / GB10	4	243	April 9, 2026
NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable DGX Spark / GB10 spark , nics , dgx	9	256	April 19, 2026
NCCL bandwidth capped at 3 GB/s, GPU PCIe topology reports Gen1 x1 on DGX Spark FE DGX Spark / GB10 pcie , kernel , performance , debugging-and-troubleshooting , nics , rdma	5	207	April 14, 2026
NCCL single-cable test caps at 100Gbps DGX Spark / GB10	16	335	March 31, 2026
ConnectX-7 NIC in DGX Spark DGX Spark / GB10	67	4725	December 2, 2025
ConnectX‑7 200GbE via MikroTik CRS812 + QSFP‑DD 400G → 2xQSFP56 200G breakout DGX Spark / GB10	5	972	January 10, 2026
Test the sample about "Connect Three DGX Spark in a Ring Topology" DGX Spark / GB10 cuda	15	458	April 13, 2026
Terrible throughput number between 2 DGX Sparks DGX Spark / GB10	2	311	March 4, 2026
Why is my NCCL broken? DGX Spark / GB10	26	459	February 19, 2026

DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s

Related topics