Got this QSFP cable: https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/qsfp-cable-0-4m-for-dgx-spark/?utm_source=nvidia
to connet 2 msi edgexpert GB 10 devices. with GEN 4 SSD.
Looking from back first/leftmost of the two ports connected. The right most port the 2 port is open.
Netplan File 1
network:
version: 2
ethernets:
enp1s0f0np0:
addresses:
- 192.168.100.10/24
dhcp4: no
mtu: 9000
enP2p1s0f0np0:
addresses:
- 192.168.101.14/24
dhcp4: no
mtu: 9000
Netplan File 2
network:
version: 2
ethernets:
enp1s0f0np0:
addresses:
- 192.168.100.11/24
dhcp4: no
dhcp6: no
mtu: 9000
enP2p1s0f0np0:
addresses:
- 192.168.101.15/24
dhcp4: no
mtu: 9000
ibdev2netdev on both machine
ssharlemin@edgexpert-4245:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
ssharlemin@edgexpert-3a77:~$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
ibstat
ssharlemin@edgexpert-4245:~$ ibstat rocep1s0f0
CA 'rocep1s0f0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.45.4028
Hardware version: 0
Node GUID: 0xfc9d050300134246
System image GUID: 0xfc9d050300134246
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xfe9d05fffe134246
Link layer: Ethernet
ssharlemin@edgexpert-3a77:~$ ibstat rocep1s0f0
CA 'rocep1s0f0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.45.4028
Hardware version: 0
Node GUID: 0xfc9d050300133a78
System image GUID: 0xfc9d050300133a78
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0xfe9d05fffe133a78
Link layer: Ethernet
ifconfig
ssharlemin@edgexpert-4245:~$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.100.10 netmask 255.255.255.0 broadcast 192.168.100.255
inet6 fe80::fe9d:5ff:fe13:4246 prefixlen 64 scopeid 0x20<link>
ether fc:9d:05:13:42:46 txqueuelen 1000 (Ethernet)
RX packets 2677 bytes 664210 (664.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1253 bytes 142836 (142.8 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ssharlemin@edgexpert-3a77:~$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.100.11 netmask 255.255.255.0 broadcast 192.168.100.255
inet6 fe80::fe9d:5ff:fe13:3a78 prefixlen 64 scopeid 0x20<link>
ether fc:9d:05:13:3a:78 txqueuelen 1000 (Ethernet)
RX packets 9080 bytes 1474546 (1.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3132 bytes 513812 (513.8 KB)
TX errors 0 dropped 4 overruns 0 carrier 0 collisions 0
NCCL Test
ssharlemin@edgexpert-4245:~$ export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
ssharlemin@edgexpert-4245:~$ export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
ssharlemin@edgexpert-4245:~$ export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
export NCCL_IB_DISABLE=0
ssharlemin@edgexpert-4245:~$ mpirun -np 2 -H 192.168.100.11:1,192.168.100.10:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added '192.168.100.11' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
# Rank 0 Group 0 Pid 15669 on edgexpert-4245 device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 220793 on edgexpert-3a77 device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 2147483648 float none -1 2855968 6.02 3.01 0 2852334 6.02 3.01 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.00963
#
# Collective test concluded: all_gather_perf
#
Avg Bandwidth is only 3 GB/s that is very low compared to what others are getting.
what i am missing? what else to check? Is the cable limiting?
Both device added to Tailscape using Nvidia Sync and I have accessed the devices directly and also using Nvidia Sync to connect in local network.