pgi 11.8 and openmpi 1.4.3

franzisko · November 9, 2011, 3:45pm

Hello,

I compiled openmpi 1.4.3 with different options (even with options suggested in pgi website) and pgi 11.8. I get a segfault when launching a hello world test program with any number of processors. sometimes the run goes ok. It happens on intel westmere while everything seems ok for amd barcelona cores

please help me because it is the only way to use cuda fortran for me!

thanks, Francesco

MatColgrove · November 9, 2011, 4:01pm

Hi Francesco,

Which system did you compile it on? Is the error actually an illegal instruction (sig 4)?

Can you run your program in the PGI debugger, pgdbg, an determine where the error occurs?

Mat

franzisko · November 9, 2011, 6:01pm

Hi mat,

thanks for fast reply,

the error is

Signal: Segmentation fault (11). Signal code: Address not mapped (1). Failing at address (nil)

.

operating system is Scientific Linux SL release 5.7 (even for the other node which works fine)

launching (if it is correct)

mpirun -np 4 pgdbg a.out

and starting the four processes, it gives:

Signalled SIGSEGV at 0x2B694A82B6EF, function ___vsnprintf_chk, file interp.c line 1217

maybe, it occurs in libnuma.so.1.

thanks,
Francesco

MatColgrove · November 9, 2011, 9:23pm

Hi Francesco,

Did you build OpenMPI on the Westmere system? If not, give that a try and see if there is some issue between the two systems.

I did have a similar issue with OpenMPI due to inconsistent compiler versions. The OpenMPI library I was using was built with PGI 10.9 while the application was built with 11.8. The differing runtime library version caused all applications to segv. Could a similar issue be occurring here? Run the “ldd” command on OpenMPI’s mpiexec and your test program. Are the runtime libraries consistent?

Mat

franzisko · November 10, 2011, 12:09pm

Hello Mat,

unfortunately even compiling on the same platform the segfault is still here. Everything (or at least Hello world example) seems fine using PGI 11.3 and even compiling with PGI 11.8 and then artificially using dynamic library from 11.3 release.

thanks, Francesco

MatColgrove · November 10, 2011, 11:33pm

Hi Francesco,

Do you mind digging into this a bit more and see if you isolate where the segv occurs as well as print out the call stack? This might give use a few more clues.

Mat

franzisko · November 11, 2011, 2:17pm

Hi Mat,

I discovered that I get segfault even running the simple ompi_info. I send you and image of the result of this run using a debugger.

External Media

thanks for any help
Francesco

MatColgrove · November 11, 2011, 10:22pm

Hi Francesco,

It looks like OpenMPI is seg faulting when trying to dynamically open a library. Since it’s being called from “opal_maffinity_base_open”, it’s most likely trying to open libnuma.so.

In looking through OpenMPI’s configure options, it looks like adding “–enable-mca-no-build=maffinity,btl-portals” will disable affinity and might work around the error.

Mat

franzisko · November 15, 2011, 9:40am

Hi Mat,

there is something I do not understand. Even compiling with the option you suggested the segfault occurs and from ompi_info |grep numa it seems that libnuma is still used. I do not know the reason. I posted in OpenMpi Forum to get help from here, too.

thanks for your help
Francesco

franzisko · November 22, 2011, 2:33pm

Hi Mat,

It seems PGI 11.8 works with OpenMPI 1.4.4, in our cluster only disabling both paffinity and maffinity components. It is not so fine because of the large use of affinity options (npersocket…) used in hybrid programming. I will try PGI 11.10 as soon as possibile to see if something is changed.

best regards
Francesco

MatColgrove · November 22, 2011, 5:06pm

Hi Francesco,

I’m wondering which libnuma.so library the OpenMPI runtime is loading. Can you look in /usr/lib to see if libnuma is there? and if so, which version? If it’s not there, openMPI may be picking up the dummy libnuma that we install on non-NUMA systems.

Mat

franzisko · November 22, 2011, 7:03pm

Hi Mat,

there is libnuma.so pointing to libnuma.so.1 in /usr/lib and /usr/lib64. When configuring Openmpi I specify --with-libnuma=/usr/. Changing this directory configure script fails.

Typing

rpm -qf /usr/lib64/libnuma.so.1

I get

numactl-0.9.8-12.el5_6.x86_64

Typing

yum info numactl.x86_64

I get

Name       : numactl
Arch       : x86_64
Version    : 0.9.8
Release    : 12.el5_6
Size       : 96 k
Repo       : installed
Summary    : libreria per migliorare le prestazioni delle macchine con Non Uniform Memory Access.
URL        : ftp://ftp.suse.com/pub/people/ak/numa/
License    : LGPL/GPL
Description: Supporto per policy Simple NUMA. Consiste in un programma numactl che esegue
           : altri programmi con una specifica policy NUMA. Contiene libnuma, utilizzata per le
           : allocazioni nelle applicazioni attraverso una policy NUMA.

we are very thankful to you for all your help

Francesco

franzisko · November 28, 2011, 5:03pm

Hi Mat,

to use maffinity avoiding segfault, I found it is possibile to select a policy different from libnuma that is first_use. I do not know any performance issue but it seems a good solution to avoid segfault with maffinity.

mpirun --mca maffinity first_use

works fine for us. I inserted the option in the global openmpi mca configuration file so I do not have to specify every time this option.

thanks
Francesco

David_Gunter · April 24, 2012, 2:52pm

We are hitting this exact problem with PGI 11.10, 12.1 and 12.4. We can either build a poor-performing OMPI or one that refuses to work at all.

At this point we’ve reached the point where we can no longer offer PGI for our customer base and will most likely not renew our licenses this year.