HPC SDK 21.09 OpenMPI + lmod + Slurm

Hi,

I’m trying to run nccl-tests compiled against software from lmod module nvhpc/21.9 via Slurm on 2 DGX A100 machines. Slurm has “MpiDefault: pmix”. I get the following error from my Slurm job:

[xxx:1090775] OPAL ERROR: Error in file …/…/…/…/…/opal/mca/pmix/pmix3x/pmix3x_client.c at line 112

The application appears to have been direct launched using “srun”,
but OMPI was not built with SLURM’s PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

Running nccl-tests with OpenMPI 4.1.2 from NVIDIA deb package through Slurm works.

“/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info --parsable | grep -i pmi” gives:
mca:pmix:isolated:version:“mca:2.1.0”
mca:pmix:isolated:version:“api:2.0.0”
mca:pmix:isolated:version:“component:4.1.2”
mca:pmix:flux:version:“mca:2.1.0”
mca:pmix:flux:version:“api:2.0.0”
mca:pmix:flux:version:“component:4.1.2”
mca:pmix:pmix3x:version:“mca:2.1.0”
mca:pmix:pmix3x:version:“api:2.0.0”
mca:pmix:pmix3x:version:“component:4.1.2”
mca:ess:pmi:version:“mca:2.1.0”
mca:ess:pmi:version:“api:3.0.0”
mca:ess:pmi:version:“component:4.1.2”

“/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/bin/ompi_info --parsable | grep -i pmi” gives:

mca:pmix:isolated:version:“mca:2.1.0”
mca:pmix:isolated:version:“api:2.0.0”
mca:pmix:isolated:version:“component:4.0.5”
mca:pmix:pmix3x:version:“mca:2.1.0”
mca:pmix:pmix3x:version:“api:2.0.0”
mca:pmix:pmix3x:version:“component:4.0.5”
mca:ess:pmi:version:“mca:2.1.0”
mca:ess:pmi:version:“api:3.0.0”
mca:ess:pmi:version:“component:4.0.5”

What is missing in the HPC SDK version of OpenMPI?

thx
Matthias