By following the instruction from Mellanox documents, I have successfully install MOFED and Mellanox SHARP on 4 GPU servers. However, when I used OpenMPI to run osu_allreduce with and without Mellanox SHARP, the performance is the same. The below snippets is my command:
Without SHARP:
$ mpirun -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 --allow-run-as-root --bind-to core --map-by node -x LD_LIBRARY_PATH -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x HCOLL_MAIN_IB=mlx5_0:1 ~/hpcx-v2.8.0-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat8.3-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/osu_allreduce -m 0:134217728
# OSU MPI Allreduce Latency Test v5.6.2
Size Avg Latency(us)
…
1048576 1441.46
2097152 2478.80
4194304 4136.40
8388608 9056.91
16777216 19946.72
33554432 46106.70
67108864 92746.40
134217728 185052.29
With SHARP
$ mpirun -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 --allow-run-as-root -bind-to core --map-by node -x LD_LIBRARY_PATH -mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_0:1 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=ib0 -x HCOLL_MAIN_IB=mlx5_0:1 -x SHARP_COLL_ENABLE_SAT=1 -x SHARP_COLL_LOG_LEVEL=3 -x HCOLL_ENABLE_SHARP=3 ~/hpcx-v2.8.0-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat8.3-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2/osu_allreduce -m 0:134217728
OSU MPI Allreduce Latency Test v5.6.2
Size Avg Latency(us)
…
1048576 1339.99
2097152 2293.31
4194304 4241.36
8388608 9110.42
16777216 19284.43
33554432 45360.29
67108864 90132.80
134217728 179329.17
Moreover, when I run distributed training over 4 GPU on RestNet50 Benchmark with and without SHARP. The similar situation happened:
Without SHARP
$ mpirun --allow-run-as-root --tag-output -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x HPCX_SHARP_DIR -x LD_LIBRARY_PATH ~/miniconda3/envs/tuantd/bin/python ~/horovod_repo/examples/pytorch/pytorch_synthetic_benchmark.py
[1,0]:Model: resnet50
[1,0]:Batch size: 32
[1,0]:Number of GPUs: 4
[1,0]:Running warmup…
[1,0]:Running benchmark…
[1,0]:Img/sec per GPU: 242.9 ±1.7
[1,0]:Total img/sec on 4 GPU(s): 971.7 ±6.7
With SHARP
**$** mpirun --allow-run-as-root --tag-output -np 4 -H 192.168.67.228:1,192.168.67.229:1,192.168.67.230:1,192.168.67.231:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include ib0 -x NCCL_SOCKET_IFNAME=ib0 -x ENABLE_SHARP_COLL=3 -x SHARP_COLL_ENABLE_SAT=3 -x NCCL_COLLNET_ENABLE=3 -x NCCL_ALGO=CollNet -x HPCX_SHARP_DIR -x LD_LIBRARY_PATH ~/miniconda3/envs/tuantd/bin/python ~/horovod_repo/examples/pytorch/pytorch_synthetic_benchmark.py
[1,0]:Model: resnet50
[1,0]:Batch size: 32
[1,0]:Number of GPUs: 4
[1,0]:Running warmup…
[1,0]:Running benchmark…
[1,0]:Img/sec per GPU: 237.0 ±21.2
[1,0]:Total img/sec on 4 GPU(s): 948.1 ±84.8
Can you guys help me?
Best regards,
Tuan