Hi all,
I’m attempting to rebuild HPCX 2.19 with Slurm + PMIx support as described here. Specifically, I’m using the hpcx-v2.19-gcc-mlnx_ofed-redhat9-cuda12-x86_64
distribution.
I believe I’ve noticed a bug in utils/hpcx_rebuild.sh
. Could you take a look?
The possible bug is around line 136 of utils/hpcx_rebuild.sh
:
. "${HPCX_ROOT}/${base_init_script}"
hpcx_load
ucx_dir=${HPCX_UCX_DIR}
ucc_dir=${HPCX_UCC_DIR}
hcoll_dir=${HPCX_HCOLL_DIR}
sharp_dir=${HPCX_SHARP_DIR}
hpcx_unload
set -eE
if [ "${rebuild_ucx}" = "yes" ]; then
name=$(basename "${ompi_prefix}")
ucx_prefix="$HPCX_ROOT/ucx/$name"
if [ -d "${ucx_prefix}" ]; then
echo "ERROR: directory '${ucx_prefix}' already exists"
exit 1
fi
# unpack sources
cd "${HPCX_ROOT}/sources"
Looks like the issue is that the function hpcx_unload
(which is defined in hpcx-init.sh
) un-sets the HPCX_ROOT
variable. So, when the hpcx_rebuild.sh
gets to the command cd "${HPCX_ROOT}/sources"
it fails because it ends up trying to do cd /sources
, which does not exist.
For now, it seems to work for me if I do something like this and run the patched script instead:
sed 's/HPCX_ROOT/_HPCX_ROOT/g' hpcx_rebuild.sh > hpcx_rebuild_patched.sh
Thanks,
Ron
Ron Rahaman
Research Scientist II, Research Software Engineer
Partnership for an Advanced Computing Environment (PACE)
Open Source Programming Office (OSPO)
Georgia Institute of Technology