NCCL For 2 Sparks Setup - Errors?

Hi All,

I have been trying to configure a pair of sparks in a new installation to talk to each other and sync via NCCL. I have completed the first ‘playbook’ ‘Connect two Sparks’. I am now trying to complete the 2nd playbook called ‘NCCL for two sparks’ ( NCCL for Two Sparks | DGX Spark ).

I have read the text there and also numerous threads in the forums, but no matter what I have changed, I always get the same or similar result. I am seeing the error output:

WARNING: An invalid value was given for btl_tcp_if_include. This
value will be ignored.

Local host: spark-8ad8
Value: enp1s0f0np0
Message: Unknown interface name

The instructions say to run the mpirun command on both sparks, but is not clear about which port identifiers should be used from which spark when running the command on each unit.

In any case, I have now tried every possible variation - and always get the same format of error. It is as if the system expects the same port name to be present on both boxes, when I have different port names on each box (Which I think aligns with the instructions on build.nvidia ( NCCL for Two Sparks | DGX Spark ).

Spark 1:

ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up).

Spark 2:

rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

So I take from this output that the device on Spark 1 is: enp1s0f1np1
and Spark 2 is: enP2p1s0f0np0

However, when I run the mpirun command, I always see the aforementioned error - complaining of an incorrect ethernet device for one or other of the sparks.

export PORT_NAME=enp1s0f0np0
export UCX_NET_DEVICES=$PORT_NAME
export NCCL_SOCKET_IFNAME=$PORT_NAME
export OMPI_MCA_btl_tcp_if_include=$PORT_NAME

Running commands only on Spark 2:
If I use ‘enp1s0f0np0’ for the port_name, then I’ll get an error complaining about an unknown interface error relating to Spark 1. If I use the other port name then I get the same error for Spark 2. So essentially, I always see an error.

I have an added complexity of having changed the SSH port on Spark 2 so that I could access it remotely, separate to Spark 1, but I have been able to supply the port number to mpirun in order to overcome that where needed.

I tested unsetting the if_include variable completely and then got a different result: Running the command on Spark 1 resulted in it saying it had identified a potential PID on Spark 2 but couldn’t sync. When I ran top on Spark 2, I could see the named PID was consuming 100% cpu. When I killed that task, Spark 1 detected it and stopped it’s own processing.

So there is a connection of sorts - but not proper syncing.

I am not in the same physical location as the units, so I can’t easily change the connector cable. My colleague tells me that the cable is connecting the devices such at the two sockets that are closest to each other on the devices are connected. Could it be that one of the sockets needs to be swapped on one unit?

Any other ideas?

Thanks

The error is not that NCCL is broken. It’s Open MPI saying: “You told me to only use interface enp1s0f0np0, but on one of the nodes that interface is effectively not usable / not present in the way I expect.” Your ibdev2netdev output shows the cable is plugged into port 1 on Spark 1 and port 0 on Spark 2, that means cross-cabled. The NVIDIA docs + NVIDIA engineer comments all assume the cable connects the same physical port on both Sparks, so the same interface name is “up” on both boxes. Once you see that, all the weirdness suddenly makes sense.

The simplest correction is to fix the physical cabling. Connect inner cage to inner cage or outer with outer.
Then run ibdev2netdev again on both systems.

If you really can’t touch the cable right now you can hack around the cross-cabling, but it’s ugly:

You would need different env vars per node

On Spark 1: MN_IF_NAME=enp1s0f1np1
On Spark 2: MN_IF_NAME=enp1s0f0np0

And you’d have to avoid btl_tcp_if_include in a way that enforces the same value on both nodes (because mpirun -x propagates one value everywhere)

But honestly, given that the official docs, NVIDIA engineers comments, and community posts all assume same-slot cabling, the clean fix is to swap the cable so that both Sparks use the same physical port, then follow the standard playbook.

2 Likes

Aha, that’s great info thanks. I had only seen the playbook pages that are on the topic of linking 2 sparks and I do not recall those pages mentioning specific physical ports have to be used in the correct pairing. The cable/port has been changed and now the NCCL test passes with a speed of 22.93GB/s.

As far as I am aware, that is around 92% of theoretical max speed for these ConnectX-7 ports, so I’ll tick this test as passed and move on to the next step of the journey.

Thanks so much for your help.

Glad to help @ura6, the speed looks good!
Also check @eugr comment below (NCCL For 2 Sparks Setup - Errors? - #7 by eugr) on how to improve speed a little bit and fix latency to get the most out of your cluster.

1 Like

The playbook is incomplete, and if you set the variables the same way when using it for real workloads, you will lose a lot on latency, because it will be using Ethernet instead of Infiniband.

To get the most, you need to set these two in addition to the ones you already set:

export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0

Use BOTH roceXXX devices that show as UP in ibdev2netdev. You will see NCCL test speeds increase to ~24 GB/s, but most importantly, the latency will be way down, and you will see real improvements in real workflows (e.g. VLLM, torch distributed, etc).

EDIT: to make sure it uses Infiniband, you can set export NCCL_DEBUG=INFO and read the logs.

3 Likes

Oh, that’s great info - thanks a lot - bookmarked.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.