I am encountering SHARP-related error when testing qe/6.8
System: AMD EPYC 7543 and 8 x A100-SXM-80GB
OS: CentOS 7.9.2009 with 3.10.0-1160 kernel
Env: HPC-X from nvidia_hpc_sdk/21.9
The system is exactly same as in my previous post , with the exceptions that we are using HPC-X from SDK/21.9.
https://forums.developer.nvidia.com/t/sharp-error-in-sharp-connect-tree/209506
Since there is not much information on the impact of SHARP collectives on scientific computing softwares such as GROMACS/LAMMPS/QE, we are conducting a systematic investigation.
[Problem description]
QE immediately crashes when SHARP is enabled.
- stderr
[LOG_CAT_SHARP] Failed to initialize SHArP collectives:Cannot connect to SHArPD(-8) job ID:1648951297
[LOG_CAT_SHARP] Fallback is disabled. exiting ...
- stdout
[gpu30:0:21648 unique id 0] DEBUG libsharp<->sharpd: abstract socket name [sharpd_hpcx_2.5.0]
[gpu30:0:21648 unique id 1648951297] ERROR Not connected in sharp_init_client_session.
[gpu30:0:21648 - context.c:276] ERROR failed to open sharp session with SHARPD daemon. please check daemon status
- sharpd status on gpu30
sharpd.service - SHARP Daemon (sharpd). Version: 2.5.1.MLNX20210812.e3c2616
Loaded: loaded (/etc/systemd/system/sharpd.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/sharpd.service.d
└─Service.conf
Active: active (running) since Fri 2022-03-25 13:37:01 KST; 2 days ago
Main PID: 3930 (sharpd)
Tasks: 6
Memory: 61.8M
CGroup: /system.slice/sharpd.service
└─3930 /opt/mellanox/sharp/bin/sharpd -P -O -/etc/sharp/sharpd.cfg
- We have SHARP from Mellanox installed at
/opt/mellanox/sharp/
./sharp_hello -d mlx5_0:1
Test Passed.
- SHARP from HPC-X
./sharp_hello -d mlx5_0:1
[gpu30:0:46020 unique id 11460005204950825110] ERROR Not connected in sharp_init_client_session.
[gpu30:0:46020 - context.c:276] ERROR failed to open sharp session with SHARPD daemon. please check daemon status
sharp_coll_init failed: Cannot connect to SHArPD
So clearly its a daemon-related issue. My questions are:
- How can I force HPC-X to utilize the
sharpd
from Mellanox ? - Must I slate
sharpd
from Mellanox with one from HPC-X ? This is quite inconvenience since we have other src-built OpenMPI that utilized system-wide SHARP.
Regards.