Bug in HPCX 2.19 `utils/hpcx_rebuild.sh`?

Hi all,

I’m attempting to rebuild HPCX 2.19 with Slurm + PMIx support as described here. Specifically, I’m using the hpcx-v2.19-gcc-mlnx_ofed-redhat9-cuda12-x86_64 distribution.

I believe I’ve noticed a bug in utils/hpcx_rebuild.sh. Could you take a look?

The possible bug is around line 136 of utils/hpcx_rebuild.sh:

. "${HPCX_ROOT}/${base_init_script}"
hpcx_load
ucx_dir=${HPCX_UCX_DIR}
ucc_dir=${HPCX_UCC_DIR}
hcoll_dir=${HPCX_HCOLL_DIR}
sharp_dir=${HPCX_SHARP_DIR}
hpcx_unload

set -eE

if [ "${rebuild_ucx}" = "yes" ]; then
    name=$(basename "${ompi_prefix}")
    ucx_prefix="$HPCX_ROOT/ucx/$name"
    if [ -d "${ucx_prefix}" ]; then
        echo "ERROR: directory '${ucx_prefix}' already exists"
        exit 1
    fi

    # unpack sources
    cd "${HPCX_ROOT}/sources"

Looks like the issue is that the function hpcx_unload (which is defined in hpcx-init.sh) un-sets the HPCX_ROOT variable. So, when the hpcx_rebuild.sh gets to the command cd "${HPCX_ROOT}/sources" it fails because it ends up trying to do cd /sources , which does not exist.

For now, it seems to work for me if I do something like this and run the patched script instead:

sed 's/HPCX_ROOT/_HPCX_ROOT/g' hpcx_rebuild.sh > hpcx_rebuild_patched.sh

Thanks,
Ron


Ron Rahaman
Research Scientist II, Research Software Engineer
Partnership for an Advanced Computing Environment (PACE)
Open Source Programming Office (OSPO)
Georgia Institute of Technology

Thank you for your share.

Indeed if a bug, it will need to be logged/filled, reproduced and pushed to our engineering team.
Should you have a support contract with Nvidia, you can open a support case and we will further address accordingly. “Networking-support@nvidia.com”.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.