[SHARP] Error When Running nccl-tests with multi-GPUs per node using SHARP

nariaki.tateiwa · September 24, 2024, 8:29am

I’m experiencing errors when attempting to run nccl-tests with SHARP enabled in multi-GPU per node configurations. I’m hoping to get some insights into the cause of these errors.

Environment

OS: Ubuntu 20.04.6 LTS x 4 servers with 2 x V100 GPU and ConnectX-6
HPC-X: hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64
SHARP
- sharp_am: v 3.8.0
- plugin: hpcx-v2.20-gcc-mlnx_ofed-ubuntu20.04-cuda12-x86_64/nccl_rdma_sharp_plugin
nccl-tests: commit 9d26b8422ba76c098df996b96e13b8ddf3a71165

Summary and Question

I ran nccl-tests in four different cases, and I found that in Case 2, where multiple GPUs are used per node, SHARP initialization fails, and Streaming Aggregation is disabled.

Case 1: 2 GPUs (2 nodes × 1 process/node × 1 GPU/process) … SHARP Available
Case 2: 4 GPUs (2 nodes × 1 process/node × 2 GPUs/process) … SHARP Error
Case 3: 4 GPUs (4 nodes × 1 process/node × 1 GPU/process) … SHARP Available

Error out from Case 2 is here.

[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.

[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource

Question: Why does SHARP get disabled with the No resource SHArP Job init error in Cases 2, where multi-GPUs are used per node? I believe that in the case of multi-GPUs per node, two SHARP jobs were being created, and the failure occured during the initialization of the second job, which might be the cause of the issue. Any insights would be appreciated!

Related: I believe this issue is related, but the information in this thread did not resolve it.

Using SHARP failed which sharp_coll_comm_init running failed. · Issue #151 · Mellanox/nccl-rdma-sharp-plugins · GitHub

Details

Case 1: 2 GPUs (2 nodes × 1 process/node × 1 GPU/process)

SHARP works without issues in this configuration:

mpirun -n 2 --host snail01:1,snail02:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -g 1 -b 64M -e 128M -f 2

Output (SHARP enabled):

# nThread 1 nGpus 1 minBytes 67108864 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3769265 on    snail01 device  0 [0x84] Tesla V100-PCIE-16GB
#  Rank  1 Group  0 Pid 2231269 on    snail02 device  0 [0x84] Tesla V100-PCIE-16GB
[snail01:0:3769265 - context.c:670][2024-09-24 16:17:01] INFO job (ID: 9370583448898977576) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01:0:3769265 - context.c:867][2024-09-24 16:17:01] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[snail01:0:3769265 - context.c:882][2024-09-24 16:17:01] INFO sharp_job_id:1    tree_type:SAT tree_idx:1  treeID:64 caps:0x16
[snail01:0:3769265 - comm.c:400][2024-09-24 16:17:01] INFO [group#:0] job_id:1 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail01:0:3769265 - comm.c:400][2024-09-24 16:17:01] INFO [group#:1] job_id:1 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864      16777216     float     sum      -1   6891.2    9.74    9.74      0   6882.9    9.75    9.75      0
   134217728      33554432     float     sum      -1    13698    9.80    9.80      0    13689    9.80    9.80      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.77284 
#

Case 2: 4 GPUs (2 nodes × 1 process/node × 2 GPUs/process)

In this configuration, SHARP gets disabled with the following error:

mpirun -n 2 --host snail01:1,snail02:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -g 2 -b 64M -e 128M -f 2

Error:

[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.

[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource

full output is here.

# nThread 1 nGpus 2 minBytes 67108864 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3745196 on    snail01 device  0 [0x84] Tesla V100-PCIE-16GB
#  Rank  1 Group  0 Pid 3745196 on    snail01 device  1 [0x89] Tesla V100S-PCIE-32GB
#  Rank  2 Group  0 Pid 2230757 on    snail02 device  0 [0x84] Tesla V100-PCIE-16GB
#  Rank  3 Group  0 Pid 2230757 on    snail02 device  1 [0x89] Tesla V100S-PCIE-32GB
[snail01:0:3745196 - context.c:670][2024-09-24 16:15:29] INFO job (ID: 9370583448952299738) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01:0:3745196 - context.c:867][2024-09-24 16:15:29] INFO sharp_job_id:1    resv_key: tree_type:LLT tree_idx:0  treeID:0 caps:0x6 quota:(osts:25 user_data_per_ost:1024 max_groups:25 max_qps:1 max_group_channels:1)
[snail01:0:3745196 - context.c:882][2024-09-24 16:15:29] INFO sharp_job_id:1    tree_type:SAT tree_idx:1  treeID:64 caps:0x16
[snail01:0:3745196 - comm.c:400][2024-09-24 16:15:29] INFO [group#:0] job_id:1 group id:0 tree idx:0 tree_type:LLT rail_idx:0 group size:2 quota: (osts:8 user_data_per_ost:1024) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail01:0:3745196 - comm.c:400][2024-09-24 16:15:29] INFO [group#:1] job_id:1 group id:0 tree idx:1 tree_type:SAT rail_idx:0 group size:2 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
[snail01:0:3745196 - context.c:670][2024-09-24 16:15:29] INFO job (ID: 9370583446459220315) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[snail01][Sep 24 16:15:29 847428][GENERAL][3745742][warn ] - Begin job id: 9370583446459220315 failed with status: No resource
[snail01:0:3745196 unique id 9370583446459220315][2024-09-24 16:15:29] ERROR Job error in sharp_get_job_data_len.

[snail01:0:3745196 - context.c:709][2024-09-24 16:15:29] ERROR sharp_get_job_data_len failed: Job error(-35)
[snail01:0:3745196 - context.c:718][2024-09-24 16:15:29] ERROR SHArP Job init error: No resource
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864      16777216     float     sum      -1    13075    5.13    7.70      0    13175    5.09    7.64      0
   134217728      33554432     float     sum      -1    26096    5.14    7.71      0    26128    5.14    7.71      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.68984 
#

The performance is equivalent to when SHARP is not used.

Case 3: 4 GPUs (4 nodes × 1 process/node × 1 GPU/process)

This setup works fine, with SHARP enabled:

mpirun -n 4 --host snail01:1,snail02:1,snail03:1,snail04:1 -x NCCL_COLLNET_ENABLE=1 -x LD_LIBRARY_PATH /data/nccl-tests/build/all_reduce_perf -t 1 -g 1 -b 64M -e 128M -f 2

xiaofengl · September 25, 2024, 1:19am

SHARP require hardware resource, at latest HDR IB switch and HDR IB HCA, but if just use HDR only 1 SAT tree available. If use multi SAT tree you need use NDR IB NET.

I see you use V100 GPU, seem no NDR combination configuration for V100 GPU.

nariaki.tateiwa · September 25, 2024, 1:44am

Thank you for the comment! Based on your explanation, it seems that when using 2 GPUs per node, multiple SATs are required, and the failure occurs due to the limitations of my environment. Is there any way to avoid the need for multiple SATs?

For example, I assume that if I set NCCL_ALGO=CollnetChain, SHARP would only be executed by the master rank of each node, so a single SAT should be sufficient. However, even when I set NCCL_ALGO=CollnetChain, the execution still fails.

xiaofengl · September 25, 2024, 6:38am

That depend on NCCL topo graph with how many communicator on init. Not depend on chain/direct select. Also depend on HCA rail to rail link to switch (maybe your 2 GPU → HCA link on same switch chip).

We always suggest user leave NCCL_ALGO blank, let NCCL select best algorithm.

nariaki.tateiwa · September 29, 2024, 11:51pm

Thank you for the clarification!

That depends on the NCCL topo graph with how many communicators on init. Not depend on chain/direct select.

I see. Is there any documentation available that provides more details on this? If possible, could you kindly point me to the relevant source code in the GitHub repository?

Also, my environment consists of 2 GPUs and 2 HCAs per server. Across the 4 nodes, the total of 8 GPUs are connected to a single NVIDIA Quantum Switch.

I plan to add another NVIDIA Quantum Switch to my environment soon. In this setup, I hope that creating multiple SATs will resolve above issue I’m currently facing.

system · October 13, 2024, 11:51pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[SHARP] Failed to initialize SHArP collectives Application Accelerator Software	2	2131	May 12, 2022
nccl-test with nccl2 not run in centos6, crash in init rank GPU-Accelerated Libraries	1	641	February 2, 2018
about running cuda on a gpu cluster CUDA Programming and Performance	25	21644	May 31, 2010
Fast Multi-GPU collectives with NCCL Technical Blog	14	1111	May 11, 2018
Why Mellanox SHARP didn't improve the performance when running tests with OpenMPI and Deep Learning Distributed Training Benchmark? Software And Drivers	1	421	February 23, 2021
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10812	January 18, 2008
nccl-test with nccl2 not run in centos6, crash in init rank CUDA Programming and Performance	2	728	February 2, 2018
2 GTX295 SLI Nqueens project CUDA Programming and Performance	31	17882	February 18, 2009
[SHARP] error in sharp_connect_tree Application Accelerator Software	2	1557	May 12, 2022
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3509	March 10, 2011