MPI only using 1 port on dual port IB NIC

I have a test setup with 2 nodes HGX A100.
Both nodes contain 4 cards; MCX653106A-ECAT-SP, they use splitter cables, 8 links to 4 ports on an MQM8700. All 8 ports on both nodes are active @ 100gbps and have been measured at ~linerate.

When launching MPI like this:
./mpirun -np 2 --host 10.0.99.245,10.0.99.246 -x NCCL_P2P_LEVEL=PXB singularity exec /env/nvidia_pytorch_22.08.sif ../nccl-tests2/build/all_reduce_perf -g 8 -b 32M -e 2048M -t 1 -n 200 -w 10 -f 2

It shows on the switch that only 1 port on the cards get used. I then tried
./mpirun -np 2 --host 10.0.99.245,10.0.99.246 -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 singularity exec /env/nvidia_pytorch_22.08.sif ../nccl-tests2/build/all_reduce_perf -g 8 -b 32M -e 2048M -t 1 -n 200 -w 10 -f 2

Which still results in only mlx5_0, mlx5_2, mlx5_6 and mlx5_8 getting used.
This command:
./mpirun -np 2 --host 10.0.99.245,10.0.99.246 -x NCCL_IB_HCA=mlx5_1,mlx5_3,mlx5_7,mlx5_9 singularity exec /env/nvidia_pytorch_22.08.sif ../all_reduce_perf -g 8 -b 32M -e 2048M -t 1 -n 200 -w 10 -f 2
Shows the same bandwidth and routes traffic over the other ports, indicating all ports work correctly.

When running verbose, it shows how GPU0 and GPU1 both find mlx5_0 the best option, is there a way we can set some affinity to let GPU0 run over mlx5_0 and GPU1 over mlx5_1 etc?

Hello Bryon,

Have you attempted using a preceding ‘=’, as noted in the NCCL_IB_HCA section of the Environment Variables — NCCL 2.15.5 documentation ? It may be that the match is being performed incorrectly.

Also, would recommend experimenting with the port specifier argument as well.

If this does not succeed, it may be best to open an issue on the NCCL Github, or engage our support team by creating a ticket via our portal at ESPCommunity for further assistance.

Best,
NVIDIA Technical Support

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.