[SHARP] error in sharp_connect_tree

Hi,

I am encountering SHARP-related error when testing gromacs/21.3
System: AMD EPYC 7543 and 8 x A100-SXM-80GB
OS: CentOS 7.9.2009 with 3.10.0-1160 kernel
Env: gcc/10.2, cuda/11.4, openmpi 4.1.1 built against both UCX and HCOLL

$ ompi_info
Configure command line: 
    '--with-ucx=/apps/common/ucx/1.11.2'
    '--with-ucx-libdir=/apps/common/ucx/1.11.2/lib'
    '--with-hcoll=/opt/mellanox/hcoll'

[Problem description]

  • Error only appears when running gromacs on more than 3 nodes.

    Nevertheless, calculations proceeded to the end without crashing.

# [gpu30:0:19792 - context.c:763] INFO sharp_job_id:469  tree_type:LLT tree_idx:0  treeID:0x0   caps:0x6 quota:(osts:167 user_data_per_ost:1024 max_groups:167 max_qps:64 max_group_channels:1)
# [gpu30:0:19792 - context.c:767] INFO sharp_job_id:469  tree_type:SAT tree_idx:1  treeID:0x3f  caps:0x16
# [gpu30:0:19792 unique id 47] ERROR AN MAD error in sharp_connect_tree.
# [gpu30:0:19792 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
# [gpu32:128:61292 unique id 12] ERROR AN MAD error in sharp_connect_tree.
# [gpu32:128:61292 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
# [gpu31:64:5267 unique id 10] ERROR AN MAD error in sharp_connect_tree.
# [gpu31:64:5267 - comm.c:31] ERROR sharp_connect_tree failed: AN MAD error(-18)
  • mpirun options used during tests
mpirun  
    --map-by numa:PE=8                         
    --mca pml ucx                                     
    --mca coll_hcoll_enable 1                
    -x OMP_NUM_THREADS=8              
    -x HCOLL_ENABLE_SHARP=3        
    -x SHARP_COLL_ENABLE_SAT=1  
    -x SHARP_COLL_LOG_LEVEL=3     
    -x UCX_TLS=dc,sm,self                   
    gmx_mpi                                           
    ...
  • sharpd status (on gpu30)
sharpd.service - SHARP Daemon (sharpd). Version: 2.5.1.MLNX20210812.e3c2616
   Loaded: loaded (/etc/systemd/system/sharpd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sharpd.service.d
           └─Service.conf
   Active: active (running) since Fri 2022-03-25 13:37:01 KST; 2 days ago
 Main PID: 3930 (sharpd)
    Tasks: 6
   Memory: 61.3M
   CGroup: /system.slice/sharpd.service
           └─3930 /opt/mellanox/sharp/bin/sharpd -P -O -/etc/sharp/sharpd.cfg

  • sharpd log (on gpu30)
[Mar 28 09:35:40 863847][SD][3934][info ] - waiting for QPAlloc alloc QP response from AN
[Mar 28 09:35:46 877923][SD][3934][info ] - QPAlloc alloc QP response status: 0x6e, mad status: 0x0
[Mar 28 09:35:46 877969][SD][3934][error] - recv AM QPAlloc alloc QP MAD failed 0x6e
[Mar 28 09:35:46 877995][SD][3934][info ] - connect tree QPN slot 0 QPN 0x0
[Mar 28 09:35:46 878001][SD][3934][info ] - connect tree job ID 2884894721 tree ID 63 local QPN 0x11299 AN QPN 0x0 status 18
[Mar 28 09:35:46 878082][SD][3930][info ] - read 8 message length 32 read count 1 opcode 0x90 TID 0x21
[Mar 28 09:35:46 900397][SD][3930][info ] - receiving from client 4
[Mar 28 09:35:46 900417][SD][3930][info ] - client 4 read 40 message length 64 read count 1 opcode 0xc TID 0x22
[Mar 28 09:35:46 900421][SD][3930][info ] - SHARPD_OP_LEAVE_GROUP TID 0x22
[Mar 28 09:35:46 900441][SD][3934][info ] - leave group ID 0x21 tree ID 0 AN QPN 0x82d911
[Mar 28 09:35:46 900456][SD][3934][info ] - leave AN LID 21 group ID 0x21 PKey 0xffff MTU 4 rate 16 SL 0 PLL 18 from tree ID 0 PathRecord
[Mar 28 09:35:46 900463][SD][3934][info ] - AN GroupJoin leave request MAD TID 0x30b
  • Other observation:
    OSU benchmarks did not produce the above error.

The log message from sharp daemon is too cryptic. Suggestions to further diagnose the issue are much appreciated.

Regards.

What is the MOFED version?
how many HCA per server? is there any binding of the process to HCA? Can you check if you can run with 1 process per
server?

Hi,

  1. MOFED version:
    We are using MLNX_OFED_LINUX-5.4-3.1.0.0

  2. HCA per server:
    For HGX-A100, there are 10 HCA per server, i.e. mlx5_{0…9}

  3. Process binding to HCA:
    As I understand, the process binding to rail is automatically handled by UCX.
    We do not use any specific binding options.

  4. Test with one process per server:
    In case of GROMACS, we are using full 64 cores per node, i.e a CPU:GPU ratio of 8:1.
    When sharp_connect_tree error appears, the total number of processes is 64x4 = 256.
    In any case, we will reduce the number of process per node per your suggestion and see if the problem persists.

Regards.