DGX Spark NCCL Test: 10GB/s not 200 Gbps=25 GB/s

Hi everyone,

I followed the playbook on the NVIDIA guide (Try NVIDIA NIM APIs), but am noticing quite a bit slower bandwidth. I am currently at 41% utilization

export PORT_NAME=enp1s0f0np0
export UCX_NET_DEVICES=$PORT_NAME
export NCCL_SOCKET_IFNAME=$PORT_NAME
export OMPI_MCA_btl_tcp_if_include=$PORT_NAME

export DEVICE_1_IP=169.254.155.221export DEVICE_2_IP=169.254.174.230

mpirun -np 2 -H $DEVICE_1_IP:1,$DEVICE_2_IP:1 –mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,NET $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

Results:

# nccl-tests version 2.17.6 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  96122 on prior-node device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  28364 on posterior-node device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   839441   20.47   10.23       0   836266   20.54   10.27       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 10.2523
#
# Collective test concluded: all_gather_perf

When I ran with NCCL_DEBUG I see:

NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 ‘rocep1s0f0’
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 ‘rocep1s0f1’
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 ‘roceP2p1s0f0’
NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 ‘roceP2p1s0f1’
ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)

How do I increase my bandwidth utilization? I would have figured enabling GDR, but with the new unified CPU-GPU architecture, I am not sure if that is required.

Thank you

Hi,
It looks like you have both CX-7 ports connected via cable, but the playbook is designed for 1 port only. Having both ports will bump down speed to 100 Gbps. The bandwidth you are seeing is around 80Gbps so this is consistent.

Ah great, disconnecting one of the cables gave me a more expected result

Warning: Permanently added '169.254.174.230' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.6 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 108332 on prior-node device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  42734 on posterior-node device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   391017   43.94   21.97       0   386256   44.48   22.24       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 22.1036
#
# Collective test concluded: all_gather_perf

Is there a way to set it so that the second port is more of a backup/failover rather than default splitting the bandwidth by 2?

Either that, or being able to use both connections for a NCCL test simultaneously? Thinking from a reliability stand point, dual connections would be preferred, than a single one.

We are updating our documentation on how to do this. Stay tuned

1 Like