I compiled openmpi 1.4.3 with different options (even with options suggested in pgi website) and pgi 11.8. I get a segfault when launching a hello world test program with any number of processors. sometimes the run goes ok. It happens on intel westmere while everything seems ok for amd barcelona cores
please help me because it is the only way to use cuda fortran for me!
Which system did you compile it on? Is the error actually an illegal instruction (sig 4)?
Can you run your program in the PGI debugger, pgdbg, an determine where the error occurs?
thanks for fast reply,
the error is
Signal: Segmentation fault (11). Signal code: Address not mapped (1). Failing at address (nil)
operating system is Scientific Linux SL release 5.7 (even for the other node which works fine)
launching (if it is correct)
mpirun -np 4 pgdbg a.out
and starting the four processes, it gives:
Signalled SIGSEGV at 0x2B694A82B6EF, function ___vsnprintf_chk, file interp.c line 1217
maybe, it occurs in libnuma.so.1.
Did you build OpenMPI on the Westmere system? If not, give that a try and see if there is some issue between the two systems.
I did have a similar issue with OpenMPI due to inconsistent compiler versions. The OpenMPI library I was using was built with PGI 10.9 while the application was built with 11.8. The differing runtime library version caused all applications to segv. Could a similar issue be occurring here? Run the “ldd” command on OpenMPI’s mpiexec and your test program. Are the runtime libraries consistent?
unfortunately even compiling on the same platform the segfault is still here. Everything (or at least Hello world example) seems fine using PGI 11.3 and even compiling with PGI 11.8 and then artificially using dynamic library from 11.3 release.
Do you mind digging into this a bit more and see if you isolate where the segv occurs as well as print out the call stack? This might give use a few more clues.
I discovered that I get segfault even running the simple ompi_info. I send you and image of the result of this run using a debugger.
thanks for any help
It looks like OpenMPI is seg faulting when trying to dynamically open a library. Since it’s being called from “opal_maffinity_base_open”, it’s most likely trying to open libnuma.so.
In looking through OpenMPI’s configure options, it looks like adding “–enable-mca-no-build=maffinity,btl-portals” will disable affinity and might work around the error.
there is something I do not understand. Even compiling with the option you suggested the segfault occurs and from ompi_info |grep numa it seems that libnuma is still used. I do not know the reason. I posted in OpenMpi Forum to get help from here, too.
thanks for your help
It seems PGI 11.8 works with OpenMPI 1.4.4, in our cluster only disabling both paffinity and maffinity components. It is not so fine because of the large use of affinity options (npersocket…) used in hybrid programming. I will try PGI 11.10 as soon as possibile to see if something is changed.
I’m wondering which libnuma.so library the OpenMPI runtime is loading. Can you look in /usr/lib to see if libnuma is there? and if so, which version? If it’s not there, openMPI may be picking up the dummy libnuma that we install on non-NUMA systems.
there is libnuma.so pointing to libnuma.so.1 in /usr/lib and /usr/lib64. When configuring Openmpi I specify --with-libnuma=/usr/. Changing this directory configure script fails.
rpm -qf /usr/lib64/libnuma.so.1
yum info numactl.x86_64
Name : numactl
Arch : x86_64
Version : 0.9.8
Release : 12.el5_6
Size : 96 k
Repo : installed
Summary : libreria per migliorare le prestazioni delle macchine con Non Uniform Memory Access.
URL : ftp://ftp.suse.com/pub/people/ak/numa/
License : LGPL/GPL
Description: Supporto per policy Simple NUMA. Consiste in un programma numactl che esegue
: altri programmi con una specifica policy NUMA. Contiene libnuma, utilizzata per le
: allocazioni nelle applicazioni attraverso una policy NUMA.
we are very thankful to you for all your help
to use maffinity avoiding segfault, I found it is possibile to select a policy different from libnuma that is first_use. I do not know any performance issue but it seems a good solution to avoid segfault with maffinity.
mpirun --mca maffinity first_use
works fine for us. I inserted the option in the global openmpi mca configuration file so I do not have to specify every time this option.
We are hitting this exact problem with PGI 11.10, 12.1 and 12.4. We can either build a poor-performing OMPI or one that refuses to work at all.
At this point we’ve reached the point where we can no longer offer PGI for our customer base and will most likely not renew our licenses this year.