MPI install and behaviour from nvidia-hpc-sdk

Greetings,

I’ve been trying the new HPC SDK for a while now, both locally on my PC and on a POWER9 cluster. Although it’s mostly working fine, we have had an issue with the openMPI package provided: if compiling a program with it, then I cannot run it without the “mpirun -np *” command, as it complains:

[lucas-Precision-7730:12307] [[INVALID],INVALID] ORTE_ERROR_LOG: A system-required executable either could not be found or was not executable by this user in file …/…/…/…/…/orte/mca/ess/singleton/ess_singleton_module.c at line 572
[lucas-Precision-7730:12307] [[INVALID],INVALID] ORTE_ERROR_LOG: A system-required executable either could not be found or was not executable by this user in file …/…/…/…/…/orte/mca/ess/singleton/ess_singleton_module.c at line 172

Sorry! You were supposed to get help about:
orte_init:startup:internal-failure
But I couldn’t open the help file:
/proj/nv/libraries/Linux_x86_64/openmpi4/2020/195106-rel/share/openmpi/help-orte-runtime: No such file or directory. Sorry!


Sorry! You were supposed to get help about:
mpi_init:startup:internal-failure
But I couldn’t open the help file:
/proj/nv/libraries/Linux_x86_64/openmpi4/2020/195106-rel/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[lucas-Precision-7730:12307] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Now, I’ve installed the package as recommended in the documentation, and set the proper environment variables as specified by it as well, at least on my own PC. Is it possible that I am missing some PATH, or is the mpi package provided poorly configured? I’ve tried with both 3.1.5 and 4.0.2, with identical results. Again, if i run the binary with mpirun -np 1, it runs perfectly, but for our code we need to make sure it can be ran without it. Any help would be appreciated!

In any case, thank you very much!

Hi Lucas,

What compiler version are you using?

This issue typically occurs when the OpenMPI “OPAL_PREFIX” environment variable does get set correct. In the 20.7 release, we did miss setting this in the module files we generate, but was able to fix this issue in 20.9.

If you’re not using our module files, then you may need to set OPAL_PREFIX to the base OpenMPI directory in your environment. Something like

export OPAL_PREFIX=/opt/nvhpc/Linux_x86_64/20.11/comm_libs/mpi/

-Mat

Hi mat!

Thanks for your quick response on both threads. Both on my pc and on power9 I’m using the 20.9 version of the compilers. Our team did notice something off with the 20.7 so we are exclusively testing the 20.9.
On my local machine, I can safely say it’s the OPAL issue: it’s not on the install procedures, so I never set this variable up. I’ll talk to our support to try it on P9, might be the case as well after all. Get back to you tomorrow with the results of we have em.

In any case thank you very much for the assist! Best!