Why is my NCCL broken?

Need some expert help.
I’ve been playing with nccl. Everything compiled and build correctly, however the results I received doesnt seem right. I followed the playbook instructions, but I must be missing something. Firstly not sure why I am getting all these authorisation problems, and importantly the avg bus bandwidth can’t be right. The playbook doesnt say what the expected result should be.

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '169.254.207.164' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.9 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   4776 on spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   4442 on spark device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
    33554432       4194304     float    none      -1   971.46   34.54   17.27       0   932.82   35.97   17.99       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 17.6278 
#
# Collective test concluded: all_gather_perf
#

Maybe this will help: spark-vllm-docker/docs/NETWORKING.md at main · eugr/spark-vllm-docker · GitHub

Thanks @eurg

I followed the instructions for the NCCL setup, I see you included two additional environment variables. Unfortunately I am still getting the same issue. Below is the command I’m running.

mpirun -np 2 -H 169.254.207.164:1,169.254.38.240:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf

Can you help run a quick test on your setup to see if you’re getting similar result.
Networking is still a new topic for me, any help will be much appreciated.

How did you setup your /etc/netplan/ files? The IP addresses in your mpirun command look suspicious.

Please, set up networking properly, preferably with static IPs. You can follow my guide to do that.
Your IP addresses seem to be autoconfigured and are from different subnets.

Thanks Chris and Eugr,

I followed the instructions from the playbook, Connect Two Sparks | DGX Spark , though choosing the auto ip assignment. Let me try setting up the static IPs and report back. I’m not an network expert, but the two sparks can communicate with one another, but the performance is under performing from my understanding of using IB.

Still no luck. I’m completely lost. help!

ibdev2netdev on both sparks

rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

Here’s the 40-cx7.yaml on the first spark, with the respective 192.168.177.12 ip on the second spark

$ sudo cat 40-cx7.yaml 
network:
  version: 2
  ethernets:
    enp1s0f0np0:
      dhcp4: no
      dhcp6: no              # Explicitly disable DHCPv6
      link-local: [ ipv4 ]
      mtu: 9000
      addresses: [192.168.177.11/24]
    enp1s0f1np1:
      dhcp4: no
      dhcp6: no
      link-local: [ ipv4 ]
      mtu: 9000

Set NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 respectively, got the below error when run ib_write_bw test

IB device enp1s0f0np0 not found
Unable to find the Infiniband/RoCE device

Next I tried the mpirun test anyways and got the following; now gettting actually worst performance than before with avg of 16.23 GB/s compared to 17.62 GB/s

export CUDA_HOME="/usr/local/cuda" 
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"

export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0
export NCCL_IB_DISABLE=0

$ mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added '192.168.177.12' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.9 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   8518 on spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid   6844 on spark device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
 17179869184    2147483648     float    none      -1   526694   32.62   16.31       0   531836   32.30   16.15       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 16.2303 
#
# Collective test concluded: all_gather_perf
#

can you post the output of these commands on both Sparks?

  • ibstat rocep1s0f0
  • ifconfig enp1s0f0np0

I assume you use the same port on both Sparks? What cable are you using?

Also, you can try to plug into the the outer port on BOTH and reconfigure interfaces.
These interfaces will be up:

rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)

Here is what I have:

ibstat

eugr@spark:~/nccl-tests$ ibstat rocep1s0f1
CA 'rocep1s0f1'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.45.4028
        Hardware version: 0
        Node GUID: 0x4cbb4703002c5e2e
        System image GUID: 0x4cbb4703002c5e2d
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x4ebb47fffe2c5e2e
                Link layer: Ethernet

ifconfig

eugr@spark:~/nccl-tests$ ifconfig enp1s0f1np1
enp1s0f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.177.11  netmask 255.255.255.0  broadcast 192.168.177.255
        inet6 fe80::4ebb:47ff:fe2c:5e2e  prefixlen 64  scopeid 0x20<link>
        ether 4c:bb:47:2c:5e:2e  txqueuelen 1000  (Ethernet)
        RX packets 38761131  bytes 317651620993 (317.6 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 7323821  bytes 815643586 (815.6 MB)
        TX errors 0  dropped 2 overruns 0  carrier 0  collisions 0

NCCL test

eugr@spark:~/nccl-tests$ # Set environment variables
export CUDA_HOME="/usr/local/cuda"
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi"
export NCCL_HOME="$HOME/nccl/build/"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"
# Set network interface environment variables (use your active interface)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1
export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0

# Run the all_gather performance test across both nodes
mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Warning: Permanently added '192.168.177.12' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.6 nccl-headers=22803 nccl-library=22803
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 568670 on      spark device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  42502 on     spark2 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
 17179869184    2147483648     float    none      -1   353056   48.66   24.33       0   352422   48.75   24.37       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.3521
#
# Collective test concluded: all_gather_perf

@kim.dang

Your netplan setup looks incorrect. Assuming the physical port on each Spark is plugged into the left one…

Just change enp1s0f1np1 to enP2p1s0f0np0 on both sparks in the 40-cx7.yaml file.

@kim.dang check your interfaces to have transport set InfiniBand and link_layer to Ethernet

For the interface with assigned IP (the one used for testing) run ibv_devinfo -d rocep1s0f0 then post the output. It should look like this:

elsaco@spark2:~$ ibv_devinfo -d rocep1s0f0
hca_id:	rocep1s0f0
	transport:			InfiniBand (0)
	fw_ver:				28.45.4028
	node_guid:			4cbb:4703:002d:a85d
	sys_image_guid:		4cbb:4703:002d:a85d
	vendor_id:			0x02c9
	vendor_part_id:		4129
	hw_ver:				0x0
	board_id:			NVD0000000087
	phys_port_cnt:			1
		port:	1
			state:			PORT_DOWN (1)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

Also, why not use the NCCL_SOCKET_IFNAME instead of NCCL_HB_HCA, like in the connection test playbook?

From Environment Variables — NCCL 2.29.1 documentation :

The NCCL_SOCKET_IFNAME variable specifies which IP interfaces to use for communication

These are two different settings, and the OP sets both. Actually, looking at OP’s post, I don’t see NCCL_IB_HCA being set before running the test.

NCCL_SOCKET_IFNAME is used only for a control channel in IB mode.
NCCL_IB_HCA is used to specify RoCE interface(s) to use for actual RDMA traffic, otherwise it will fallback to Ethernet and massively increase latency. And if both RoCE twins are not specified there, then the speed will be slower too.

@eugr

This the cable I am using at 0.5m

I see that your NCCL test is also throwing “Authorization required, but no authorization protocol specified” errors is that normal?
And your avg bus bandwidth GB/s is way higher that what I’m getting.

Here are the results of ibstat and ifconfig for both sparks. they appear similar to yours.

$ ibstat rocep1s0f0 
CA 'rocep1s0f0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.45.4028
        Hardware version: 0
        Node GUID: 0x4cbb4703007cf4e2
        System image GUID: 0x4cbb4703007cf4e2
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x4ebb47fffe7cf4e2
                Link layer: Ethernet
$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.177.11  netmask 255.255.255.0  broadcast 192.168.177.255
        inet6 fe80::4ebb:47ff:fe7c:f4e2  prefixlen 64  scopeid 0x20<link>
        ether 4c:bb:47:7c:f4:e2  txqueuelen 1000  (Ethernet)
        RX packets 29034  bytes 5145920 (5.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 41558  bytes 235352789 (235.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
$ ibstat rocep1s0f0 
CA 'rocep1s0f0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.45.4028
        Hardware version: 0
        Node GUID: 0x4cbb4703007e74f7
        System image GUID: 0x4cbb4703007e74f7
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x4ebb47fffe7e74f7
                Link layer: Ethernet
$ ifconfig enp1s0f0np0
enp1s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.177.12  netmask 255.255.255.0  broadcast 192.168.177.255
        inet6 fe80::4ebb:47ff:fe7e:74f7  prefixlen 64  scopeid 0x20<link>
        ether 4c:bb:47:7e:74:f7  txqueuelen 1000  (Ethernet)
        RX packets 57707  bytes 239142878 (239.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13044  bytes 1383836 (1.3 MB)
        TX errors 0  dropped 4 overruns 0  carrier 0  collisions 0

Hey Kim,

Once again, your netplan config is incorrect

@eugr correct me if I am wrong here.

Config looks normal so far.
However, in your previous test results I don’t see you setting NCCL_IB_HCA.

Can you make sure you set export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0

You are not correct. The OP is using a different port on Spark, that maps to enp1s0f0np0.

Yes, export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0

But don’t you need to make sure each physical port (and each port’s twin) are on the same subnet?

Here’s how I have mine (left port)

  GNU nano 7.2                                                                                                          /etc/netplan/40-cx7.yaml                                                                                                                   
network:
  version: 2
  ethernets:
    enp1s0f0np0:
      addresses:
        - 192.168.100.11/24
      dhcp4: no
      dhcp6: no              
      link-local: [ ipv4 ]   
      mtu: 9000
    enP2p1s0f0np0:
      dhcp4: no
      dhcp6: no              
      link-local: [ ipv4 ]   
      mtu: 9000

Or if you wanted on right port:

  GNU nano 7.2                                                                                                          /etc/netplan/40-cx7.yaml                                                                                                                   
network:
  version: 2
  ethernets:
    enp1s0f1np1:
      addresses:
        - 192.168.200.13/24
      dhcp4: no
      dhcp6: no              
      link-local: [ ipv4 ]   
      mtu: 9000
    enP2p1s0f1np1:
      dhcp4: no
      dhcp6: no              
      link-local: [ ipv4 ]   
      mtu: 9000

Thanks @Keyper-AI

I see your configs are same as given in eurg instructions. though, I’ve connected the cable on the left sides on both sparks when facing from the back.
I followed the playbook, Connect Two Sparks | DGX Spark , and used the auto ip assignment method where they set enp1s0f0np0 and enp1s0f1np1, both are n lower cases. So, now I’m scratching my head.

Even with the messed up addresses, it was sort of working.. I could use llama.cpp with rpc but not at the speed I was expecting.

I’m gonna rolled back and start again but use the static ip method, as it seem this is what most ppl here are doing. fingers crossed.

If you are just trying to get the left port connected (enp1s0f0np0/enP2p1s0f0np0) , then don’t set the right port in the configuration file (enp1s0f1np1/enP2p1s0f1np1).

The playbook is a little confusing because it lists them in a weird order,. It also sets the twins ( enP2) to their own IP, which I don’t get. When you do it that way, you get duplicates on the discovery.

enp1s0f0np0 = Left Port

enp1s0f1np1 = Right Port

If you were trouble shooting and happened to set any ip routes, make sure to either reboot or clear those out as well.

So rebooted the sparks. Followed the instructions but using manual config with adjustments to 40-cx7.yaml from eurg and Keyper-AI. Below are the steps in sequence.

node 1

network:
  version: 2
  ethernets:
    enp1s0f0np0:
      dhcp4: no
      dhcp6: no              # Explicitly disable DHCPv6
      link-local: [ ipv4 ]
      mtu: 9000
      addresses: [192.168.177.11/24]
    enP2p1s0f0np0:
      dhcp4: no
      dhcp6: no
      link-local: [ ipv4 ]
      mtu: 9000

node 2

network:
  version: 2
  ethernets:
    enp1s0f0np0:
      dhcp4: no
      dhcp6: no              # Explicitly disable DHCPv6
      link-local: [ ipv4 ]
      mtu: 9000
      addresses: [192.168.177.12/24]
    enP2p1s0f0np0:
      dhcp4: no
      dhcp6: no
      link-local: [ ipv4 ]
      mtu: 9000
      

For NCCL, setup the following envs

export CUDA_HOME="/usr/local/cuda" 
export MPI_HOME="/usr/lib/aarch64-linux-gnu/openmpi" 
export NCCL_HOME="$HOME/nccl/build/" 
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$CUDA_HOME/lib64/:$MPI_HOME/lib:$LD_LIBRARY_PATH"

$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0

NCCL test - now getting similar results to eugr. though not ensure why still getting the “no authorization protocol specified” messages (errors?)

mpirun -np 2 -H 192.168.177.11:1,192.168.177.12:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf
  
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Warning: Permanently added '192.168.177.11' (ED25519) to the list of known hosts.
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified

# nccl-tests version 2.17.9 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   8727 on node1 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  10473 on node2 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
 17179869184    2147483648     float    none      -1   360035   47.72   23.86       0   354416   48.47   24.24       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.0478 
#
# Collective test concluded: all_gather_perf
#  

Note to self: don’t run the mpirun test at the same time on both sparks

IB test - not resolve, more research needed

$ ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

export NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0

$ ib_write_bw 192.168.177.12 -d rocep1s0f0 --report_gbits -q 4 -R --force-link IB
Unexpected CM event bl blka 8
 Unable to perform rdma_client function
 Unable to init the socket connection