Cannot compile mpihello.c after installed HPC-SDK

Hi,there.I installed nvhpc_sdk_20.7 on a centos7.9 with CUDA driver 11.0.After Installation,I run nvcc --version mpirun --version ,they run correctly.But when I want to compile an example with offical example(…/examples/MPI/samples/mpihello),I counted following error:

make
mpif90  -fast  -o mpihello.out mpihello.f
mpif90  -fast  -o mpihello_f90.out mpihello.f90
mpicc  -fast -Bdynamic  -o myname.out myname.c
"myname.c", line 16: warning: function "printf" declared implicitly
         printf("My name is %s\n",hname);
         ^

"myname.c", line 11: warning: variable "ierr" was declared but never referenced
         int len,ierr;
                 ^

--------------- Executing mpihello.out ----------------------
mpirun -np 2 ./mpihello.out
[localhost:120913] *** Process received signal ***
[localhost:120913] Signal: Floating point exception (8)
[localhost:120913] Signal code: Integer divide-by-zero (1)
[localhost:120913] Failing at address: 0x7f1ba84b41a9
[localhost:120913] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f1ba7257630]
[localhost:120913] [ 1] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(+0x1151a9)[0x7f1ba84b41a9]
[localhost:120913] [ 2] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(+0x116522)[0x7f1ba84b5522]
[localhost:120913] [ 3] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(+0x1142a8)[0x7f1ba84b32a8]
[localhost:120913] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(+0x113d78)[0x7f1ba84b2d78]
[localhost:120913] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(+0x120abe)[0x7f1ba84bfabe]
[localhost:120913] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(opal_hwloc1117_hwloc_topology_load+0x19b)[0x7f1ba84be01b]
[localhost:120913] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0x2fe)[0x7f1ba8494c6e]
[localhost:120913] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-rte.so.40(+0x75ae9)[0x7f1ba88caae9]
[localhost:120913] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-rte.so.40(orte_init+0x296)[0x7f1ba8940ef6]
[localhost:120913] [10] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/../../lib/libopen-rte.so.40(orte_submit_init+0xb50)[0x7f1ba8941be0]
[localhost:120913] [11] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpirun[0x4013f7]
[localhost:120913] [12] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpirun[0x401302]
[localhost:120913] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1ba681e555]
[localhost:120913] [14] /opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpirun[0x401219]
[localhost:120913] *** End of error message ***
/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/comm_libs/mpi/bin/mpirun: line 15: 120913 Floating point exception(core dumped) $MY_DIR/.bin/$EXE "$@"
make: *** [run] Error 136

I noticed that it said “Signal code: Integer divide-by-zero (1)”,
but the code in mpihello.c is simply below

#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include<stdio.h>
#include<Winsock2.h>
#pragma comment(lib, "Ws2_32.lib")
#else
#include <unistd.h>
#endif
#include "mpi.h"
        main(int argc, char **argv){
         int len,ierr;
         char hname[32];
         len = 32;
         MPI_Init( &argc, &argv );
         gethostname(hname,len);
         printf("My name is %s\n",hname);
         MPI_Finalize( );
        }

There is not anything about "Integer divide-by-zero ".
I have no idea what these errors about,and I search in forums with no similar topic.

Your mpihello.out is apparently an executable from a fortran program. But all of them appear to compile, the issue is when you run.

Hi HPCNewBee,

As noted above, the error is coming from the run of “mpihello.out” which is the F77 version. Mind posting the source for this file so we can help determine the issue?

-Mat

It was the example installed with the SDK,I Did not modify any code :(

Does the crash happen with 1 rank as well? If so, add debugging flags to the compile and run in gdb, and paste the backtrace

FYI, I tested the sample program on a CentOS 7.9 system here using the OpenMP 3.1.5 that shipped with 20.7 and saw no issues. The problem is likely specific to your system.

Could you be compiling with one version of MPI but using a different version of mpirun? Or have LD_LIBRARY_PATH set so a different set of MPI runtime libraries are being used?

What is the output from the following commands:

  • which mpif90
  • mpif90 -V
  • which mpirun
  • echo $LD_LIBRARY_PATH
  • ldd mpihello.out

-Mat