Seg-fault in Hybrid(OpenMPI+OpenMP) application

I have a problem with my hybrid application(OpenMPI+OpenMP) written in FORTRAN90(PGI-Fortran 8.0-6, latest one).

The error is as follows:
[server:02656] *** Process received signal ***
[server:02656] Signal: Segmentation fault (11)
[server:02656] Signal code: Address not mapped (1)
[server:02656] Failing at address: 0x283a3b70
[server:02656] *** End of error message ***
Segmentation fault

$ldd Solver (my application):
libmpi_f90.so.0 => /opt/mpi/openmpi-pgi/lib/libmpi_f90.so.0 (0x00002ba7d7267000)
libmpi_f77.so.0 => /opt/mpi/openmpi-pgi/lib/libmpi_f77.so.0 (0x00002ba7d746a000)
libmpi.so.0 => /opt/mpi/openmpi-pgi/lib/libmpi.so.0 (0x00002ba7d7980000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003eeb800000)
libopen-rte.so.0 => /opt/mpi/openmpi-pgi/lib/libopen-rte.so.0 (0x00002ba7d8072000)
libopen-pal.so.0 => /opt/mpi/openmpi-pgi/lib/libopen-pal.so.0 (0x00002ba7d8354000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003eeb400000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ef3800000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003ef8a00000)
libpgmp.so => /opt/pgi/linux86-64/8.0-6/libso/libpgmp.so (0x00002ba7d8631000)
libpgbind.so => /opt/pgi/linux86-64/8.0-6/libso/libpgbind.so (0x00002ba7d875b000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003eea000000)
libpgf90.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf90.so (0x00002ba7d885d000)
libpgf90_rpm1.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf90_rpm1.so (0x00002ba7d8c19000)
libpgf902.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf902.so (0x00002ba7d8d1b000)
libpgf90rtl.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf90rtl.so (0x00002ba7d8e2e000)
libpgftnrtl.so => /opt/pgi/linux86-64/8.0-6/libso/libpgftnrtl.so (0x00002ba7d8f51000)
libpgc.so => /opt/pgi/linux86-64/8.0-6/libso/libpgc.so (0x00002ba7d907f000)
librt.so.1 => /lib64/librt.so.1 (0x0000003eef400000)
libm.so.6 => /lib64/libm.so.6 (0x0000003eeb000000)
libc.so.6 => /lib64/libc.so.6 (0x0000003eeac00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ee9c00000)

Compile/linker flags are:
FFLAGS = -mcmodel=medium -fastsse -O3 -Mcache_align -mp
LOADOPTS= $(FFLAGS)

However, if I remove “-mp”, it works well without faults.
Furthermore, If I use PGI Fortran 6.2, it works well with “-mp”.
If I change the order of libc and libopen-pal using some linker-option, it also works with “-mp”, namely

libm.so.6 => /lib64/libm.so.6 (0x0000003eeb000000)
libc.so.6 => /lib64/libc.so.6 (0x0000003eeac00000)
libopen-rte.so.0 => /opt/mpi/openmpi-1.3.2-pgi6.2/lib/libopen-rte.so.0 (0x00002ad3e2764000)
libopen-pal.so.0 => /opt/mpi/openmpi-1.3.2-pgi6.2/lib/libopen-pal.so.0 (0x00002ad3e2a47000)

is OK.

I saw the same trouble before in this forum. But it seems to be unsolved yet. I think that the trouble may be caused by PGI-Fortran8.0, esspecially the conflict between OpenMP and OpenMPI. Does anyone help me?

tmishima

Hi tmishima,

Sounds like a very difficult problem and one I don’t have great insight. What I would suggest you do is use the PGI CDK debugger, pgdbg, to determine the root cause of the seg fault. We could guess, but there are too many variables.

Pgdbg does support debugging of Hybid applications. The one caveat is that you need to use of the debug versions of MPICH or MPICH-2 that we include with the CDK product.

  • Mat

Hi Mat,

Thank you for your quick reply and suggestion to use PGI debugger.

My application was initially developed with MPICH/MPICH-2, and I recently switched to OpenMPI. Off course, it works well with MPICH/MPICH-2. I encountered the problem first time just after using OpenMPI. Therefore, unfortunately I cannot debug it with MPICH/MPICH-2.

I guess that PGI-fortran library ( or Open-MPI library ) would have some problems. Can I submit this trouble report to PGI-user-support in order to fix them?

tmishima

Hi tmishima,

Sure. Please a email to PGI customer support (trs@pgroup.com) and ask them to forward it to me. Please include detailed directions on how to obtain, build and run your code as well as any data files. I will do my best reproduce and diagnose the problem here but I can not guarantee success.

Once question. Does your program require “-mcmodel=medium”? I would try building without the flag, in case this is the issue.

  • Mat

Hi Mat,

OK, I will send a email including required infomation to your PGI-support next week.

My application need some libraries listed below.

  1. Basic library
    OpenMPI, ACML , BLACS, SCALAPACK
  2. Mathematical library
    METIS, MUMPS

I assume that you already have Basic ones(OpenMPI,…,SCALAPACK).
I would include .tar.gz file of METIS and MUMPS only.
IF you need prebuilt METIS& MUMPS, please let me know.
Of course, I will include source code, makefile, sample data file for my application as well.

Finaly, I already checked without “-mcmodel=medium”, but it didn’t work…
( Anyway, my application doesn’t need this flag for small sample data.)

Thank you for your cooperation!

tmishima

Hi Mat,

Additional information for you.

I checked OpenMPI FAQ closely and finaly found that:
mpirun --mca mpi_leave_pinned 0 …
is effective, no seg-fault! This flag means that the built in ptmalloc2 is not used.

I guess the ptmalloc2 in OpenMPI and codes in parallel region(OpenMP) by PGI-Fortran will conflict. I’m not sure the root cause is PGI or OpenMPI, but I’m very glad if you investigate the issue in this point of view.

For a while, I’d like to use this flag as a workarround.

tmishima