The above is the configuration of my server, each server is connected with 8 400G network cards, even two 800G switches. When I was testing ib_write_bw, I found that when two network cards share a numa, the speed will be slowed down. But when the two Nics work together, the speed will drop to 187Gb/s. I suspect that it is caused by sharing a numa. What can I do to achieve 394Gb/s for each Nics? Here are my commands to run ib_write_bw and nccl-test
server ib_write_bw -d mlx5_0 -q 5 --port 10002 --report_gbits --run_infinitely -F -n 2000
clinet ib_write_bw -d mlx5_0 -q 5 --port 10002 --report_gbits --run_infinitely -F -n 2000 10.102.20.5
mpirun -np 16 -H 10.102.20.5:8,10.102.20.6:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8 -x NCCL_NET_GDR_LEVEL=2 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x NCCL_IB_TIMEOUT=22 -x NCCL_PXN_DISABLE=0 -x NCCL_MIN_CTAS=4 -x LD_LIBRARY_PATH -x PATH -mca coll_hcoll_enable 0 -mca pml ob1 -mca btl_tcp_if_include eth0 -mca btl ^openib ./build/all_reduce_perf -b 1M -e 1G -n 1000 -g 1