[SHARP] Failed to initialize SHArP collectives

I am encountering SHARP-related error when testing qe/6.8
System: AMD EPYC 7543 and 8 x A100-SXM-80GB
OS: CentOS 7.9.2009 with 3.10.0-1160 kernel
Env: HPC-X from nvidia_hpc_sdk/21.9

The system is exactly same as in my previous post , with the exceptions that we are using HPC-X from SDK/21.9.

https://forums.developer.nvidia.com/t/sharp-error-in-sharp-connect-tree/209506

Since there is not much information on the impact of SHARP collectives on scientific computing softwares such as GROMACS/LAMMPS/QE, we are conducting a systematic investigation.

[Problem description]
QE immediately crashes when SHARP is enabled.

  • stderr
[LOG_CAT_SHARP] Failed to initialize SHArP collectives:Cannot connect to SHArPD(-8)  job ID:1648951297
[LOG_CAT_SHARP] Fallback is disabled. exiting ...
  • stdout
[gpu30:0:21648 unique id 0] DEBUG libsharp<->sharpd: abstract socket name [sharpd_hpcx_2.5.0]
[gpu30:0:21648 unique id 1648951297] ERROR Not connected in sharp_init_client_session.
[gpu30:0:21648 - context.c:276] ERROR failed to open sharp session with SHARPD daemon. please check daemon status
  • sharpd status on gpu30
sharpd.service - SHARP Daemon (sharpd). Version: 2.5.1.MLNX20210812.e3c2616
   Loaded: loaded (/etc/systemd/system/sharpd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sharpd.service.d
           └─Service.conf
   Active: active (running) since Fri 2022-03-25 13:37:01 KST; 2 days ago
 Main PID: 3930 (sharpd)
    Tasks: 6
   Memory: 61.8M
   CGroup: /system.slice/sharpd.service
           └─3930 /opt/mellanox/sharp/bin/sharpd -P -O -/etc/sharp/sharpd.cfg
  • We have SHARP from Mellanox installed at /opt/mellanox/sharp/
./sharp_hello -d mlx5_0:1
Test Passed.
  • SHARP from HPC-X
./sharp_hello -d mlx5_0:1
[gpu30:0:46020 unique id 11460005204950825110] ERROR Not connected in sharp_init_client_session.

[gpu30:0:46020 - context.c:276] ERROR failed to open sharp session with SHARPD daemon. please check daemon status
sharp_coll_init failed: Cannot connect to SHArPD

So clearly its a daemon-related issue. My questions are:

  • How can I force HPC-X to utilize the sharpd from Mellanox ?
  • Must I slate sharpd from Mellanox with one from HPC-X ? This is quite inconvenience since we have other src-built OpenMPI that utilized system-wide SHARP.

Regards.

Hi
What is the status ?
It seems like there is issue for sharpd.
Thanks,

Hi,

We solved the problem by pointing HPCX_SHARP_DIR to Mellanox’s OFED installation directory.
(https://docs.nvidia.com/networking/display/SHARPv261/Setting+up+NVIDIA+SHARP+Environment)

When checking the debug message, we encountered the following non-critical error:

5852.stderr:[INFO ] [gpu31:11:15124 - cuda_util.c:379] DEBUG cuda wrapper lib not found. CUDA is disabled. ret:28 /opt/mellanox/sharp/lib/libsharp_coll_cuda_wrapper.so: cannot open shared object file: No such file or directory

We appreciate if you can clarify the following difference between SHARP binaries distributed with HPC-X and MLNX_OFED:

  1. Collectives using GPU Buffer are only supported via HPC-X’ SHARP. Is my understanding correct ?
  2. What are the roles of the following wrappers from HPC-X ?
    • libsharp_coll_cuda_wrapper.so
    • libsharp_coll_gdrcopy_wrapper.so

Regards.