NCCL single-cable test caps at 100Gbps

Hi,

I’m connecting two Sparks with a 200GbE QSFP56 cable according to the playbooks. Connect Two Sparks works and I can see both ports are Up, but I’m having trouble with NCCL for Two Sparks.

After setting the environment variables and running the command below:
mpirun -np 2 -H 169.254.6.140:1,169.254.221.162:1 --mca plm_rsh_agent “ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no” -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -x UCX_NET_DEVICES=enp1s0f0np0 -x NCCL_SOCKET_IFNAME=enp1s0f0np0 -x NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO $HOME/nccl-tests/build/all_gather_perf -b 16G -e 16G -f 2

The best avg. bus bandwidth I get is 15.9, but I thought I should be getting 20+. So far I’ve tried
-applying higher/lower MTU in the netplan (setting both to 9000 instead of 1500 makes the bandwidth worse)

  • adding NCCL_IB_GID_INDEX=3 (which fails bc there are two indexes at most)
  • adding NCCL_IB_QPS_PER_CONNECTION=8

I also tested using ib_write_bw, same issue where the highest speed is 100Gbps.

My ibdev2netdev seems fine:

ibdev2netdev
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

It’s like the other port to allow 200Gbps through one cable isn’t working properly or something. How could I fix this?

Thank you!

You only need one cable to have 200Gbps speed. Running the playbook with a second cable will limit your bandwidth to 100Gbps

Current firmware has a performance regression in ConnectX 7 workloads. We are all seeing a drop from ~24 GB/s to ~16 GB/s on NCCL tests. NVIDIA is aware and is working on a fix.

1 Like

Yup, I’m using just one cable.

I see, thanks so much! Would that apply for ib_write_bw as well though?

I believe so. I can test on my sparks later today.