I have installed hpc sdk toolkits 24.5 on my ubuntu 22.04 server for a while. At first it worked just fine. Here are my environment variables:
export NVARCH=`uname -s`_`uname -m`
export NVCOMPILERS=/opt/nvidia/hpc_sdk
export PATH=$NVCOMPILERS/$NVARCH/24.5/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/24.5/comm_libs/mpi/man
export PATH=$NVCOMPILERS/$NVARCH/24.5/compilers/bin:$PATH
Let’s say I have a mpi fortran program named hello.f90
like this
program main
use mpi
implicit none
integer ::rank, size, mpierr
call MPI_Init(mpierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr)
call MPI_Comm_size(MPI_COMM_WORLD, size, mpierr)
print *, "rank", rank
call MPI_Finalize(mpierr)
end program main
At first I can run the executable file without mpiexec, in which case it would print “rank 0” on the screen. But rencently I do this and I get errors
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
opal_init:startup:internal-failure
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/24.5/hpcx-12/254129-rel-1/comm_libs/12.4/hpcx/hpcx-2.19/ompi/share/openmpi/help-opal-runtime.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
orte_init:startup:internal-failure
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/24.5/hpcx-12/254129-rel-1/comm_libs/12.4/hpcx/hpcx-2.19/ompi/share/openmpi/help-orte-runtime: No such file or directory. Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
mpi_init:startup:internal-failure
But I couldn't open the help file:
/proj/nv/libraries/Linux_x86_64/24.5/hpcx-12/254129-rel-1/comm_libs/12.4/hpcx/hpcx-2.19/ompi/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[v100gpu44:712455] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
When I googled the issue, I was told it was caused because I don’t have the OPAL_PREFIX in my environment. So I added OPAL_PREFIX in my environment and tried again. This time it seems that the process jamed, print nothing and keep occupying the shell session. I don’t know how to fix this
Any help would be appreciated and in any case, thanks very much for helping