HPL Segfault Problem

Hi,

I have a segfault problem with HPL-2.0 built by PGI 8.0-6, openmpi 1.4 and acml4.3.0.
Smaller size of matrix is OK, but HPL causes segfault for lager ones using more than 2GB/process.
Is this the limitation of HPL-2.0?

Anyway, please help me to solve this problem. I can send my Make.arch and HPL.dat to your specified mail address later.

Thank you, in advance.

Regards,
tmishima

P.S. for your information

The error is:
[node09:21940] *** Process received signal ***
[node09:21940] Signal: Segmentation fault (11)
[node09:21940] Signal code: Address not mapped (1)
[node09:21940] Failing at address: 0x2aaa35217088
[node09:21940] *** End of error message ***

ldd output is:
libmpi_f90.so.0 => /home/mishima/app/openmpi-pgi/lib/libmpi_f90.so.0 (0x00002b0dd4944000)
libmpi_f77.so.0 => /home/mishima/app/openmpi-pgi/lib/libmpi_f77.so.0 (0x00002b0dd4b47000)
libmpi.so.0 => /home/mishima/app/openmpi-pgi/lib/libmpi.so.0 (0x00002b0dd4d77000)
libopen-rte.so.0 => /home/mishima/app/openmpi-pgi/lib/libopen-rte.so.0 (0x00002b0dd5031000)
libopen-pal.so.0 => /home/mishima/app/openmpi-pgi/lib/libopen-pal.so.0 (0x00002b0dd527c000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003eeb400000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ef3800000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003ef8a00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003eeb800000)
libpgmp.so => /opt/pgi/linux86-64/8.0-6/libso/libpgmp.so (0x00002b0dd551a000)
libpgbind.so => /opt/pgi/linux86-64/8.0-6/libso/libpgbind.so (0x00002b0dd5644000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003eea000000)
libpgf90.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf90.so (0x00002b0dd5746000)
libpgf90_rpm1.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf90_rpm1.so (0x00002b0dd5b02000)
libpgf902.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf902.so (0x00002b0dd5c04000)
libpgf90rtl.so => /opt/pgi/linux86-64/8.0-6/libso/libpgf90rtl.so (0x00002b0dd5d17000)
libpgftnrtl.so => /opt/pgi/linux86-64/8.0-6/libso/libpgftnrtl.so (0x00002b0dd5e3a000)
libpgc.so => /opt/pgi/linux86-64/8.0-6/libso/libpgc.so (0x00002b0dd5f68000)
librt.so.1 => /lib64/librt.so.1 (0x0000003eef400000)
libm.so.6 => /lib64/libm.so.6 (0x0000003eeb000000)
libc.so.6 => /lib64/libc.so.6 (0x0000003eeac00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ee9c00000)

Hi tmishima,

Sorry, I don’t know enough about HPL to know why this would occur.

You can try adding the “-Mlarge_arrays” flag in case it’s a loop indexing size issue. Otherwise, please contact the authors of HPL for help.

  • Mat

Hi Mat,

Thank your for your advice.

“-Mlarge_arrays” flag doesn’t work well. I’m going to ask the authors of HPL for further help.

But, please let me confirm one thing. Is this due to my own environement or a kind of issue with the combination of PGI compiler and HPL-2.0?

tmishima

Hi tmishima,

While I don’t know what the cause is, my best guess is that HPL is using the MPI-1 “GetAddress” function (32-bit pointers) versus the MPI-2 “GetAddress64” function (64-pointers) or you are encountering some other type of 32-bit integer overflow error.

  • Mat

Hi, Mat

Thank you for your suggestion.

After all, I gave up PGI C comiler and change Make.arch to use gcc.
Then, everything goes fine, no segfault even with lager matrixes.
What’s the differnece between gcc and pgcc?

Main part of modified Make.arch is as follows:

ARCH = Linux_ompi

MPdir = /home/mishima/app/openmpi-pgi
MPinc = -I$(MPdir)/include
MPlib =

shared library does not work well…

LAdir = /data/app/acml4.3.0/pgi64_mp/lib
LAinc =
LAlib = $(LAdir)/libacml_mp.a

HPL_OPTS = -m64

CC = gcc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -O3 -fomit-frame-pointer

On some platforms, it is necessary to use the Fortran linker to find

the Fortran internals used in the BLAS library.

LINKER = mpif90
LINKFLAGS = -Mnomain -mp

Regards,
tmishima

Hi, Mat

I found the reason.

-Mvect=sse included in -fastsse causes segfault in HPL_dlaswp00N.c with Lager matrixes which are larger than 2Gb.

I think this would be an optimaization issue of PGI c-compiler.
Could you check it ?

Regards,
tmishima

P.S.: HPL_dlaswp00N.c is in /hpl/src/pauxil