I compiled openmpi 1.4.3 with different options (even with options suggested in pgi website) and pgi 11.8. I get a segfault when launching a hello world test program with any number of processors. sometimes the run goes ok. It happens on intel westmere while everything seems ok for amd barcelona cores
please help me because it is the only way to use cuda fortran for me!
Did you build OpenMPI on the Westmere system? If not, give that a try and see if there is some issue between the two systems.
I did have a similar issue with OpenMPI due to inconsistent compiler versions. The OpenMPI library I was using was built with PGI 10.9 while the application was built with 11.8. The differing runtime library version caused all applications to segv. Could a similar issue be occurring here? Run the “ldd” command on OpenMPI’s mpiexec and your test program. Are the runtime libraries consistent?
unfortunately even compiling on the same platform the segfault is still here. Everything (or at least Hello world example) seems fine using PGI 11.3 and even compiling with PGI 11.8 and then artificially using dynamic library from 11.3 release.
Do you mind digging into this a bit more and see if you isolate where the segv occurs as well as print out the call stack? This might give use a few more clues.
It looks like OpenMPI is seg faulting when trying to dynamically open a library. Since it’s being called from “opal_maffinity_base_open”, it’s most likely trying to open libnuma.so.
In looking through OpenMPI’s configure options, it looks like adding “–enable-mca-no-build=maffinity,btl-portals” will disable affinity and might work around the error.
there is something I do not understand. Even compiling with the option you suggested the segfault occurs and from ompi_info |grep numa it seems that libnuma is still used. I do not know the reason. I posted in OpenMpi Forum to get help from here, too.
It seems PGI 11.8 works with OpenMPI 1.4.4, in our cluster only disabling both paffinity and maffinity components. It is not so fine because of the large use of affinity options (npersocket…) used in hybrid programming. I will try PGI 11.10 as soon as possibile to see if something is changed.
I’m wondering which libnuma.so library the OpenMPI runtime is loading. Can you look in /usr/lib to see if libnuma is there? and if so, which version? If it’s not there, openMPI may be picking up the dummy libnuma that we install on non-NUMA systems.
there is libnuma.so pointing to libnuma.so.1 in /usr/lib and /usr/lib64. When configuring Openmpi I specify --with-libnuma=/usr/. Changing this directory configure script fails.
Typing
rpm -qf /usr/lib64/libnuma.so.1
I get
numactl-0.9.8-12.el5_6.x86_64
Typing
yum info numactl.x86_64
I get
Name : numactl
Arch : x86_64
Version : 0.9.8
Release : 12.el5_6
Size : 96 k
Repo : installed
Summary : libreria per migliorare le prestazioni delle macchine con Non Uniform Memory Access.
URL : ftp://ftp.suse.com/pub/people/ak/numa/
License : LGPL/GPL
Description: Supporto per policy Simple NUMA. Consiste in un programma numactl che esegue
: altri programmi con una specifica policy NUMA. Contiene libnuma, utilizzata per le
: allocazioni nelle applicazioni attraverso una policy NUMA.
to use maffinity avoiding segfault, I found it is possibile to select a policy different from libnuma that is first_use. I do not know any performance issue but it seems a good solution to avoid segfault with maffinity.
mpirun --mca maffinity first_use
works fine for us. I inserted the option in the global openmpi mca configuration file so I do not have to specify every time this option.