Need some expert help.
I’ve been playing with nccl. Everything compiled and build correctly, however the results I received doesnt seem right. I followed the playbook instructions, but I must be missing something. Firstly not sure why I am getting all these authorisation problems, and importantly the avg bus bandwidth can’t be right. The playbook doesnt say what the expected result should be.
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added '169.254.207.164' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
# nccl-tests version 2.17.9 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 4776 on spark device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 4442 on spark device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 4194304 float none -1 971.46 34.54 17.27 0 932.82 35.97 17.99 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 17.6278
#
# Collective test concluded: all_gather_perf
#
I followed the instructions for the NCCL setup, I see you included two additional environment variables. Unfortunately I am still getting the same issue. Below is the command I’m running.
Can you help run a quick test on your setup to see if you’re getting similar result.
Networking is still a new topic for me, any help will be much appreciated.
Please, set up networking properly, preferably with static IPs. You can follow my guide to do that.
Your IP addresses seem to be autoconfigured and are from different subnets.
I followed the instructions from the playbook, Connect Two Sparks | DGX Spark , though choosing the auto ip assignment. Let me try setting up the static IPs and report back. I’m not an network expert, but the two sparks can communicate with one another, but the performance is under performing from my understanding of using IB.
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
Here’s the 40-cx7.yaml on the first spark, with the respective 192.168.177.12 ip on the second spark
$ sudo cat 40-cx7.yaml
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [ ipv4 ]
mtu: 9000
addresses: [192.168.177.11/24]
enp1s0f1np1:
dhcp4: no
dhcp6: no
link-local: [ ipv4 ]
mtu: 9000
Set NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 respectively, got the below error when run ib_write_bw test
IB device enp1s0f0np0 not found
Unable to find the Infiniband/RoCE device
Next I tried the mpirun test anyways and got the following; now gettting actually worst performance than before with avg of 16.23 GB/s compared to 17.62 GB/s
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
export NCCL_IB_DISABLE=0
$ mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added '192.168.177.12' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
# nccl-tests version 2.17.9 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 8518 on spark device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 6844 on spark device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 2147483648 float none -1 526694 32.62 16.31 0 531836 32.30 16.15 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 16.2303
#
# Collective test concluded: all_gather_perf
#
can you post the output of these commands on both Sparks?
ibstat rocep1s0f0
ifconfig enp1s0f0np0
I assume you use the same port on both Sparks? What cable are you using?
Also, you can try to plug into the the outer port on BOTH and reconfigure interfaces.
These interfaces will be up:
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
Here is what I have:
ibstat
eugr@spark:~/nccl-tests$ ibstat rocep1s0f1
CA 'rocep1s0f1'
CA type: MT4129
Number of ports: 1
Firmware version: 28.45.4028
Hardware version: 0
Node GUID: 0x4cbb4703002c5e2e
System image GUID: 0x4cbb4703002c5e2d
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x4ebb47fffe2c5e2e
Link layer: Ethernet
These are two different settings, and the OP sets both. Actually, looking at OP’s post, I don’t see NCCL_IB_HCA being set before running the test.
NCCL_SOCKET_IFNAME is used only for a control channel in IB mode.
NCCL_IB_HCA is used to specify RoCE interface(s) to use for actual RDMA traffic, otherwise it will fallback to Ethernet and massively increase latency. And if both RoCE twins are not specified there, then the speed will be slower too.
I see that your NCCL test is also throwing “Authorization required, but no authorization protocol specified” errors is that normal?
And your avg bus bandwidth GB/s is way higher that what I’m getting.
Here are the results of ibstat and ifconfig for both sparks. they appear similar to yours.
$ ibstat rocep1s0f0
CA 'rocep1s0f0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.45.4028
Hardware version: 0
Node GUID: 0x4cbb4703007cf4e2
System image GUID: 0x4cbb4703007cf4e2
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x4ebb47fffe7cf4e2
Link layer: Ethernet
$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.177.11 netmask 255.255.255.0 broadcast 192.168.177.255
inet6 fe80::4ebb:47ff:fe7c:f4e2 prefixlen 64 scopeid 0x20<link>
ether 4c:bb:47:7c:f4:e2 txqueuelen 1000 (Ethernet)
RX packets 29034 bytes 5145920 (5.1 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 41558 bytes 235352789 (235.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
$ ibstat rocep1s0f0
CA 'rocep1s0f0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.45.4028
Hardware version: 0
Node GUID: 0x4cbb4703007e74f7
System image GUID: 0x4cbb4703007e74f7
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x4ebb47fffe7e74f7
Link layer: Ethernet
$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.177.12 netmask 255.255.255.0 broadcast 192.168.177.255
inet6 fe80::4ebb:47ff:fe7e:74f7 prefixlen 64 scopeid 0x20<link>
ether 4c:bb:47:7e:74:f7 txqueuelen 1000 (Ethernet)
RX packets 57707 bytes 239142878 (239.1 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13044 bytes 1383836 (1.3 MB)
TX errors 0 dropped 4 overruns 0 carrier 0 collisions 0
I see your configs are same as given in eurg instructions. though, I’ve connected the cable on the left sides on both sparks when facing from the back.
I followed the playbook, Connect Two Sparks | DGX Spark , and used the auto ip assignment method where they set enp1s0f0np0 and enp1s0f1np1, both are n lower cases. So, now I’m scratching my head.
Even with the messed up addresses, it was sort of working.. I could use llama.cpp with rpc but not at the speed I was expecting.
I’m gonna rolled back and start again but use the static ip method, as it seem this is what most ppl here are doing. fingers crossed.
If you are just trying to get the left port connected (enp1s0f0np0/enP2p1s0f0np0) , then don’t set the right port in the configuration file (enp1s0f1np1/enP2p1s0f1np1).
The playbook is a little confusing because it lists them in a weird order,. It also sets the twins ( enP2) to their own IP, which I don’t get. When you do it that way, you get duplicates on the discovery.
enp1s0f0np0 = Left Port
enp1s0f1np1 = Right Port
If you were trouble shooting and happened to set any ip routes, make sure to either reboot or clear those out as well.
So rebooted the sparks. Followed the instructions but using manual config with adjustments to 40-cx7.yaml from eurg and Keyper-AI. Below are the steps in sequence.
node 1
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [ ipv4 ]
mtu: 9000
addresses: [192.168.177.11/24]
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: [ ipv4 ]
mtu: 9000
node 2
network:
version: 2
ethernets:
enp1s0f0np0:
dhcp4: no
dhcp6: no # Explicitly disable DHCPv6
link-local: [ ipv4 ]
mtu: 9000
addresses: [192.168.177.12/24]
enP2p1s0f0np0:
dhcp4: no
dhcp6: no
link-local: [ ipv4 ]
mtu: 9000
For NCCL, setup the following envs
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
NCCL test - now getting similar results to eugr. though not ensure why still getting the “no authorization protocol specified” messages (errors?)
mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
--mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
$HOME/nccl-tests/build/all_gather_perf
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added '192.168.177.11' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
# nccl-tests version 2.17.9 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 8727 on node1 device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 10473 on node2 device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
17179869184 2147483648 float none -1 360035 47.72 23.86 0 354416 48.47 24.24 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.0478
#
# Collective test concluded: all_gather_perf
#
Note to self: don’t run the mpirun test at the same time on both sparks
IB test - not resolve, more research needed
$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0
$ ib_write_bw 192.168.177.12 -d rocep1s0f0 --report_gbits -q 4 -R --force-link IB
Unexpected CM event bl blka 8
Unable to perform rdma_client function
Unable to init the socket connection