Community Edition problem with simple mpi program

Hello,

After compiling a simple mpi fortran program (calculate pi) with mpif77 i get an error:

-bash-4.1$ mpirun -np 4 ./pi
request to allocate mask for invalid number; abort
: Success

Primary job terminated normally, but 1 process returned
a non-zero exit code… Per user-direction, the job has been aborted.

request to allocate mask for invalid number; abort
: Success
request to allocate mask for invalid number; abort
: Success

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[31357,1],0]
Exit code: 1

This error is specific to this particular machine. The same compiled program runs on at least one other machine.

This is running on Scientific 6 (an RHEL 6 clone)

Any help would be greatly appreciated.

If the program was this

program main
use mpi
double precision  PI25DT
parameter        (PI25DT = 3.141592653589793238462643d0)
double precision  mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, ierr
!                                function to integrate
f(a) = 4.d0 / (1.d0 + a*a)

call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

do
   if (myid .eq. 0) then
   print *, 'Enter the number of intervals: (0 quits) '
   read(*,*) n
endif
!                                broadcast n
call MPI_BCAST(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
!                                check for quit signal
if (n .le. 0) exit
!                                calculate the interval size
h = 1.0d0/n
sum  = 0.0d0
do i = myid+1, n, numprocs
    x = h * (dble(i) - 0.5d0)
    sum = sum + f(x)
enddo
mypi = h * sum
!                                collect all the partial sums
call MPI_REDUCE(mypi, pi, 1, MPI_DOUBLE_PRECISION, &
                        MPI_SUM, 0, MPI_COMM_WORLD, ierr)
!                                node 0 prints the answer.
if (myid .eq. 0) then
print *, 'pi is ', pi, ' Error is', abs(pi - PI25DT)
endifenddo
call MPI_FINALIZE(ierr)
end

I was able to compile with mpif90 (only use pgf77/mpif77 when
it is necessary that you use f77 features not in f90 )

mpif90 -o my_pi my_pi.f -Mfree

I then ran it with

% mpirun -np 4 my_pi

Enter the number of intervals: (0 quits)
10
pi is 3.142425985001098 Error is 8.3333141130470523E-004
Enter the number of intervals: (0 quits)
100
pi is 3.141600986923125 Error is 8.3333333318336145E-006
Enter the number of intervals: (0 quits)
1000
pi is 3.141592736923126 Error is 8.3333333122936892E-008
Enter the number of intervals: (0 quits)
10000
pi is 3.141592654423124 Error is 8.3333073774838340E-010
Enter the number of intervals: (0 quits)
1000000
pi is 3.141592653589903 Error is 1.1013412404281553E-013
Enter the number of intervals: (0 quits)
2000000
pi is 3.141592653589759 Error is 3.4194869158454821E-014
Enter the number of intervals: (0 quits)
0

Program is similar, I made some minor changes a few years ago. As I mentioned this fails on one particular machine. It runs OK on 2 others that I tried.

The error -

request to allocate mask for invalid number; abort

comes from libnuma.so.

Not sure what the problem is.

If you

ldd ./pi

on each of the platforms in your machines list, you may find
that libnuma.so is different on one platform than another.

If you compile with

-mp

for OpenMP (not MPI), try compiling again with

-mp=nonuma

so that libnuma is not an issue.

dave

Thanks for this.

I recompiled with -mp and -mp=nonuma. As expected -mp failed while -mp=nonuma succeeded. The issue is definitely numa and I have some work to do.

The question on the failing platform is
“is libnuma.so the same on the failing platform as on
the compiling platform”

So look at

ldd ./pi

and see if
libnuma.so is a pointer to libnuma.so.1

If not, it could be that libnuma.so does not exist, but libnuma.so.1 does.
Best to soft link libnuma.so to libnuma.so.1.

If libnuma.so.1 does not exist, PGI provides a dummy version.

When compiling -mp on one platform to run on another platform,
the libnuma.so situation needs to be the same on both.

You could install the PGI compilers on one platform as a “network
install”, and then add the failing platform as a new PGI compiler
host (run add_network_host and the machine is added).

The network installs handle the libnuma.so differences by creating
a local directory of the same name on each platform. For example " /local/username/shared_objects" would have the correct
disposition of libnuma.so (pointer to libnuma.so.1 or a dummy version of libnuma.so.1) Then every platform reconciles libnuma.so
correctly at runtime.

dave