NCCL Test Bandwidth is only 3GB/s between 2 DGX Spark using QSFP cable

Got this QSFP cable: https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/qsfp-cable-0-4m-for-dgx-spark/?utm_source=nvidia
to connet 2 msi edgexpert GB 10 devices. with GEN 4 SSD.

Looking from back first/leftmost of the two ports connected. The right most port the 2 port is open.

Netplan File 1

network:
  version: 2
  ethernets:
    enp1s0f0np0:
      addresses:
        - 192.168.100.10/24
      dhcp4: no
      mtu: 9000
    enP2p1s0f0np0:
      addresses:
        - 192.168.101.14/24
      dhcp4: no
      mtu: 9000

Netplan File 2

network:
  version: 2
  ethernets:
    enp1s0f0np0:
      addresses:
        - 192.168.100.11/24
      dhcp4: no
      dhcp6: no
      mtu: 9000
    enP2p1s0f0np0:
      addresses:
        - 192.168.101.15/24
      dhcp4: no
      mtu: 9000

ibdev2netdev on both machine

ssharlemin@edgexpert-4245:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
ssharlemin@edgexpert-3a77:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

ibstat

ssharlemin@edgexpert-4245:~$ ibstat rocep1s0f0
CA 'rocep1s0f0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.45.4028
        Hardware version: 0
        Node GUID: 0xfc9d050300134246
        System image GUID: 0xfc9d050300134246
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xfe9d05fffe134246
                Link layer: Ethernet
ssharlemin@edgexpert-3a77:~$ ibstat rocep1s0f0
CA 'rocep1s0f0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.45.4028
        Hardware version: 0
        Node GUID: 0xfc9d050300133a78
        System image GUID: 0xfc9d050300133a78
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0xfe9d05fffe133a78
                Link layer: Ethernet

ifconfig

ssharlemin@edgexpert-4245:~$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.100.10  netmask 255.255.255.0  broadcast 192.168.100.255
        inet6 fe80::fe9d:5ff:fe13:4246  prefixlen 64  scopeid 0x20<link>
        ether fc:9d:05:13:42:46  txqueuelen 1000  (Ethernet)
        RX packets 2677  bytes 664210 (664.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1253  bytes 142836 (142.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
ssharlemin@edgexpert-3a77:~$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.100.11  netmask 255.255.255.0  broadcast 192.168.100.255
        inet6 fe80::fe9d:5ff:fe13:3a78  prefixlen 64  scopeid 0x20<link>
        ether fc:9d:05:13:3a:78  txqueuelen 1000  (Ethernet)
        RX packets 9080  bytes 1474546 (1.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3132  bytes 513812 (513.8 KB)
        TX errors 0  dropped 4 overruns 0  carrier 0  collisions 0

NCCL Test

ssharlemin@edgexpert-4245:~$ export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
ssharlemin@edgexpert-4245:~$ export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
ssharlemin@edgexpert-4245:~$ export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
export NCCL_IB_DISABLE=0
ssharlemin@edgexpert-4245:~$ mpirun -np 2 -H 192.168.100.11:1,192.168.100.10:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '192.168.100.11' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
#  Rank  0 Group  0 Pid  15669 on edgexpert-4245 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid 220793 on edgexpert-3a77 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1  2855968    6.02    3.01       0  2852334    6.02    3.01       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.00963
#
# Collective test concluded: all_gather_perf
#

Avg Bandwidth is only 3 GB/s that is very low compared to what others are getting.

what i am missing? what else to check? Is the cable limiting?

Both device added to Tailscape using Nvidia Sync and I have accessed the devices directly and also using Nvidia Sync to connect in local network.

RMDA_Write BW Test

ssharlemin@edgexpert-3a77:~/dgx-spark-playbooks/nvidia/connect-two-sparks/assets$ ib_write_bw -d rocep1s0f0

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : rocep1s0f0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0128 PSN 0x5aba51 RKey 0x1843ec VAddr 0x00e3b193bd5000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:100:11
 remote address: LID 0000 QPN 0x0128 PSN 0xb07842 RKey 0x1843ec VAddr 0x00f8cb36d9d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:100:10
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             12552.57            12550.68                  0.200811
---------------------------------------------------------------------------------------

I get it - the single 200G port is physically bifurcated into two 100G lanes

ssharlemin@edgexpert-4245:~/dgx-spark-playbooks/nvidia/connect-two-sparks/assets$ ib_write_bw -d roceP2p1s0f0 192.168.101.15
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : roceP2p1s0f0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01a8 PSN 0x98d0e6 RKey 0x1a03ec VAddr 0x00f18fb215d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:14
 remote address: LID 0000 QPN 0x01a8 PSN 0x564716 RKey 0x1a03ec VAddr 0x00e9c29233d000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:15
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      5000             12555.05            12554.23                  0.200868
---------------------------------------------------------------------------------------

Ok got 16GB/s now

ssharlemin@edgexpert-3a77:~/dgx-spark-playbooks/nvidia/connect-two-sparks/assets$ mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '192.168.100.10' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
#  Rank  0 Group  0 Pid   6614 on edgexpert-3a77 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  29228 on edgexpert-4245 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   532698   32.25   16.13       0   528880   32.48   16.24       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.1835
#
# Collective test concluded: all_gather_perf
#

I guess there is an update incoming to get the fix. Thanks i was able to use the other focum posts to make progress

busbw 16GB/s is still not enough speed.
you should do this,

sudo fwupdmgr enable-remote lvfs-testing
sudo fwupdmgr refresh --force
sudo fwupdmgr update

after updating firmware,
you will get 22GB/s

Is this safe to use:

lvfs-testing

cc: @eugr

As the name suggests, I understand that the lvfs-testing channel is where firmware is uploaded for testing purposes. All uploaded firmware is posted directly by the OEM vendor, so we cannot guarantee 100% safety. However, since the uploads are made directly by the OEM vendor, I understand that you should contact them directly if any issues arise. For reference, I am currently using one NVIDIA FE and three EdgeXpert units without any problems.

I think it’s generally safe, but not officially supported by vendors.
I’m surprised some OEMs haven’t released the firmware with CX7 fixes yet.

Cool, are you using it? :)

Nope :) I want to keep my Sparks on official firmware, so my builds can be tested on the configuration that is used by a majority of users.

I was able to get 24 GB/s with the firmware udpdates. Thanks @s0ne @vgoklani @eugr

ssharlemin@edgexpert-4245:~/dgx-spark-playbooks/nvidia/connect-two-sparks/assets$ mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '192.168.100.11' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

/home/ssharlemin/nccl-tests/build/all_gather_perf: error while loading shared libraries: libnccl.so.2: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/home/ssharlemin/nccl-tests/build/all_gather_perf: error while loading shared libraries: libnccl.so.2: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46420,1],0]
  Exit code:    127
--------------------------------------------------------------------------
ssharlemin@edgexpert-4245:~/dgx-spark-playbooks/nvidia/connect-two-sparks/assets$ export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"

# Point out-of-band communication to the active Ethernet interfaces
export UCX_NET_DEVICES=enp1s0f0np0,enP2p1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0,enP2p1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0

# Bind NCCL to the bifurcated 100G logical rails for full 200G multi-rail throughput
export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
export NCCL_IB_DISABLE=0
ssharlemin@edgexpert-4245:~/dgx-spark-playbooks/nvidia/connect-two-sparks/assets$  mpirun -np 2 -H 192.168.100.10:1,192.168.100.11:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '192.168.100.11' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5270 on edgexpert-4245 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   4060 on edgexpert-3a77 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   354669   48.44   24.22       0   353648   48.58   24.29       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.2545
#
# Collective test concluded: all_gather_perf
#

disabled the channel after the updae though

sudo fwupdmgr disable-remote lvfs-testing
sudo fwupdmgr refresh --force

Thanks for all the response :)

I was able to run some local model ./run-recipe.sh qwen3.5-397b-int4-autoround.yaml --no-ray using GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub @eugr 's docker before the update.

I plan to run QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ · Hugging Face in the next hours and see how that goes :)

If i ran into problems I will create a new thread.