Error in "NCCL for Two Sparks" Playbook

Gentlefolk,

I’m trying to bring up my pair of DGX Spark boxes. I’ve succeeded in booting and installing updates on both machines. I’ve successfully executed the scripts to identify the QSFP/CX7 network. Now I am executing the recommended playbook: NCCL for Two Sparks.

On step 5,

# Set network interface environment variables (use your Up interface from the previous step)
export UCX_NET_DEVICES=enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f1np1
export OMPI_MCA_btl_tcp_if_include=enp1s0f1np1

# Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H <IP for Node 1>:1,<IP for Node 2>:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf

I receive the following error on both nodes:

-bash: IP: No such file or directory

As this is not the stack I intend to use on this machine, I am loathe to debug the problem. That said, if there is an easy fix, such as creating a directory, please feel free to enlighten me, an NCCL noob.

Anon,
Andrew

The error is because you need to replace <IP for Node ?>, by the actual IP addresses. Assuming you have executed the previous playbooy, you can execute this command on both sparks:

ibdev2netdev

This will give you something like this:

roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)

Check the 2 interfaces up and execute:

  ip addr show enp1s0f1np1

This will give you the 2 IP addresses of both sparks. Let’s say it’s 192.168.100.1 and 192.168.100.2. Then you need to run the command like this:


# Run the all_gather performance test across both nodes (replace the IP addresses with the ones you found in the previous step)
mpirun -np 2 -H 192.168.100.1:1,192.168.100.2:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_gather_perf

Mr. Amorim,

Confirmed.

That said, I am disappointed in NVIDIA for such a primitive script and uninformative error message. Please pass back to the documentation team, the Product Management team, and the development team their failure. Yes, supercomputing is hard. Don’t make it harder by hampering adoption of your advanced technology by poor scripting practices.

Again, thank you for your rapid and clear response.

Anon,
Andrew

I’m not part of the NVIDIA team, Andrew. Just a regular forum member like you. But they’re just regular people like you and me that happen to be working on a product we spent a lot of money to acquire. But they’re making progress and improving. Believe me, we’ve been through a lot since mid-october.
Things will work out.

1 Like

Hi @nvidia3453, I understand that debugging unknown solutions can be frustrating. However, the NCCL playbook does explain that you will need to find your specific NIC IPs, as this will be different for every unit if assigned automatically, and to use them for this part of the playbook.
If you see any other problems please don’t hesitate to reach out.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.