runtime segfault: mpich2-1.3.2 with pgi v11.5 on rhel5.6 sys

Hi Everybody,

I need some help on mpich2-1.3.2 with pgi v11.5 on rhel5.6 system.

Previously I have successfully compiled and run mpich2-1.3.2 with pgi
v10.0 on rhel5.6. Recently I upgraded the pgi compiler from v10.0 to
v11.5, and I followed the exact same steps to compile mpich2-1.3.2,
the compilation was fine, but I got segfault when I try to run mpiexec
or mpirun.

The steps are below:

[root@flatline mpich2-1.3.2]# env CC=pgcc FC=pgf90 F77=pgf77 CXX=pgCC
./configure --prefix=/home/lgu/mpich2_install --enable-shared
[root@flatline mpich2-1.3.2]# make
[root@flatline mpich2-1.3.2]# make install
[root@flatline mpich2-1.3.2]#
[root@flatline mpich2-1.3.2]# which mpiexec
/home/lgu/mpich2_install/bin/mpiexec
[root@flatline mpich2-1.3.2]# which mpicc
/home/lgu/mpich2_install/bin/mpicc
[root@flatline mpich2-1.3.2]# which pgcc
/usr/pgi/linux86-64/11.5/bin/pgcc
[root@flatline mpich2-1.3.2]#
[root@flatline mpich2-1.3.2]# cd examples/
[root@flatline examples]# mpicc -o cpi cpi.c
[root@flatline examples]# mpiexec -hosts master -np 1 ./cpi
Segmentation fault
[root@flatline examples]# mpiexec
Segmentation fault
[root@flatline examples]#

When I tried to run “strace mpiexec”, it shows mmap tried to allocate
huge memory, and it failed.

open(“/sys/devices/system/node”, O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
brk(0) = 0x19125000
brk(0x1914e000) = 0x1914e000
getdents(3, /* 4 entries /, 32768) = 112
getdents(3, /
0 entries */, 32768) = 0
close(3) = 0
mmap(NULL, 18446744073223036928, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 18446744073223168000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2b4e88607000
munmap(0x2b4e88607000, 60788736) = 0
munmap(0x2b4e90000000, 6320128) = 0
mprotect(0x2b4e8c000000, 135168, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 18446744073223036928, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
— SIGSEGV (Segmentation fault) @ 0 (0) —
+++ killed by SIGSEGV +++




I have also tried to configure with “CC=pgcc FC=pgfortran
F77=pgfortran CXX=pgcpp CFLAGS=-fast FCFLAGS=-fast FFLAGS=-fast
CXXFLAGS=-fast” like pgi guide suggests, I got the same segfault.

Does anybody know what possibly caused this problem?

BTW, this happens on rhel5.6, I have successfully built and run
mpich2-1.3.2 with the new pgi v11.5 on rhel4.9.


Thank you very much for any help on this!

Limin

Hello,

The firs thing to check is whether the multi-threaded runtime libs
that are linked by default with 11.* compilers, and not with prior ones,
could be causing this.


Try adding LDFLAGS=-nomp

and rebuilding, and see what happens.



dave

You might also want to link any application that uses MPICH2 with -nomp,
to determine if that changes it’s behavior.

dave

Thanks Dave!

I re-run ./configure with added “LDFLAGS=-nomp”, and rebuilt mpich2, mpiexec still segfaults. I am not even running any other program, just “mpiexec” or “mpirun” without any argument, and I got segfaults.

I also get similar segfault with openmpi, but mvapich2 works fine.

I also tried different installations of pgi 11.5 on different rhel5.6 systems, I got the same segfault for mpich2 and openmpi builds.

My previous pgi version was 10.0, maybe I should try another version of pgi, which one is better, 10.9 or 11.4?

Limin

I think 10.9 would be the better choice if 11.* releases are failing for
this case.

Have you tried building the MPI library with 10.9, and then building the application
with 11.5, linking in the 10.9 MPI lib?

Thanks Dave!

I tried v11.4, same failure, immediate segfault. And I just tried 10.9, no segfault so far, I’ll do more testing.

No, I haven’t tried to mix compiling and linking with different versions of pgi.