I have a test setup with 2 nodes HGX A100.
Both nodes contain 4 cards; MCX653106A-ECAT-SP, they use splitter cables, 8 links to 4 ports on an MQM8700. All 8 ports on both nodes are active @ 100gbps and have been measured at ~linerate.
When launching MPI like this:
./mpirun -np 2 --host 10.0.99.245,10.0.99.246 -x NCCL_P2P_LEVEL=PXB singularity exec /env/nvidia_pytorch_22.08.sif ../nccl-tests2/build/all_reduce_perf -g 8 -b 32M -e 2048M -t 1 -n 200 -w 10 -f 2
It shows on the switch that only 1 port on the cards get used. I then tried
./mpirun -np 2 --host 10.0.99.245,10.0.99.246 -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 singularity exec /env/nvidia_pytorch_22.08.sif ../nccl-tests2/build/all_reduce_perf -g 8 -b 32M -e 2048M -t 1 -n 200 -w 10 -f 2
Which still results in only mlx5_0, mlx5_2, mlx5_6 and mlx5_8 getting used.
This command:
./mpirun -np 2 --host 10.0.99.245,10.0.99.246 -x NCCL_IB_HCA=mlx5_1,mlx5_3,mlx5_7,mlx5_9 singularity exec /env/nvidia_pytorch_22.08.sif ../all_reduce_perf -g 8 -b 32M -e 2048M -t 1 -n 200 -w 10 -f 2
Shows the same bandwidth and routes traffic over the other ports, indicating all ports work correctly.
When running verbose, it shows how GPU0 and GPU1 both find mlx5_0 the best option, is there a way we can set some affinity to let GPU0 run over mlx5_0 and GPU1 over mlx5_1 etc?