How to configure switches for running NCCL test


I’m trying to run NCCL test in a 2 node setup but getting errors in NCCL test. I’m not able to find any documentation on how to configure the server NICs and the switches (2 LEAF and 1 SPINE - 1 node connected to each LEAF). I’m able to successfully run pretest if I configure all the NICs and switches in L3 mode (/31 p2p). I’m not able to understand how NCCL will be sending traffic in L3 and L2 mode. Can you please help me on this?


2 Servers with 8 H100 and 8 Connectx each
Each server connected to a LEAF
Both LEAF switches are connected to a SPINE

TensorRT Version:
GPU Type: H100
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered