NCCL For 2 Sparks Setup - Errors?

The playbook is incomplete, and if you set the variables the same way when using it for real workloads, you will lose a lot on latency, because it will be using Ethernet instead of Infiniband.

To get the most, you need to set these two in addition to the ones you already set:

export NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1
export NCCL_IB_DISABLE=0

Use BOTH roceXXX devices that show as UP in ibdev2netdev. You will see NCCL test speeds increase to ~24 GB/s, but most importantly, the latency will be way down, and you will see real improvements in real workflows (e.g. VLLM, torch distributed, etc).

EDIT: to make sure it uses Infiniband, you can set export NCCL_DEBUG=INFO and read the logs.

3 Likes