Hi All,
I have been trying to configure a pair of sparks in a new installation to talk to each other and sync via NCCL. I have completed the first ‘playbook’ ‘Connect two Sparks’. I am now trying to complete the 2nd playbook called ‘NCCL for two sparks’ ( NCCL for Two Sparks | DGX Spark ).
I have read the text there and also numerous threads in the forums, but no matter what I have changed, I always get the same or similar result. I am seeing the error output:
WARNING: An invalid value was given for btl_tcp_if_include. This
value will be ignored.
Local host: spark-8ad8
Value: enp1s0f0np0
Message: Unknown interface name
The instructions say to run the mpirun command on both sparks, but is not clear about which port identifiers should be used from which spark when running the command on each unit.
In any case, I have now tried every possible variation - and always get the same format of error. It is as if the system expects the same port name to be present on both boxes, when I have different port names on each box (Which I think aligns with the instructions on build.nvidia ( NCCL for Two Sparks | DGX Spark ).
Spark 1:
ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up).
Spark 2:
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
So I take from this output that the device on Spark 1 is: enp1s0f1np1
and Spark 2 is: enP2p1s0f0np0
However, when I run the mpirun command, I always see the aforementioned error - complaining of an incorrect ethernet device for one or other of the sparks.
export PORT_NAME=enp1s0f0np0
export UCX_NET_DEVICES=$PORT_NAME
export NCCL_SOCKET_IFNAME=$PORT_NAME
export OMPI_MCA_btl_tcp_if_include=$PORT_NAME
Running commands only on Spark 2:
If I use ‘enp1s0f0np0’ for the port_name, then I’ll get an error complaining about an unknown interface error relating to Spark 1. If I use the other port name then I get the same error for Spark 2. So essentially, I always see an error.
I have an added complexity of having changed the SSH port on Spark 2 so that I could access it remotely, separate to Spark 1, but I have been able to supply the port number to mpirun in order to overcome that where needed.
I tested unsetting the if_include variable completely and then got a different result: Running the command on Spark 1 resulted in it saying it had identified a potential PID on Spark 2 but couldn’t sync. When I ran top on Spark 2, I could see the named PID was consuming 100% cpu. When I killed that task, Spark 1 detected it and stopped it’s own processing.
So there is a connection of sorts - but not proper syncing.
I am not in the same physical location as the units, so I can’t easily change the connector cable. My colleague tells me that the cable is connecting the devices such at the two sockets that are closest to each other on the devices are connected. Could it be that one of the sockets needs to be swapped on one unit?
Any other ideas?
Thanks