Building MVAPICH2 with PGI 2010

(This might not be the right forum for this, but it seemed best to me. If not, please move.)

I was wondering if anyone at PGI or in the Forum Community had a “best” build strategy for MVAPICH2 1.5 with PGI 2010, a la Open MPI, say:

http://www.pgroup.com/resources/openmpi/openmpi141_pgi2010.htm

I’m just building for one node, so I’m looking mainly at just the MPI bit of MVAPICH2 without the need for all the Infiniband/OFED/etc. I know it’s odd to use MVAPICH2 on a single node, but it’s what’s used in large clusters here, so I thought I’d be nice to troubleshoot building my code with a PGI+MVAPICH2 environment in a more controlled area (and I assume the OFED is done correctly on the larger clusters maintained by specialists).

I’m also looking for optimizations that have worked. Is -fast viable, or do you need to dial down?

Hi,

I would look at:

http://www.pgroup.com/resources/mpich/mpich2_121_pgi2010.htm

Example configuration for mvapich2 would be:

env CC=pgcc FC=pgfortran F77=pgfortran CXX=pgcpp CFLAGS=-fast FCFLAGS=-fast FFLAGS=-fast
CXXFLAGS=-fast ./configure --with-device=ch3:sock --prefix=/usr/local/mpich2 >& configure.log


Good luck.
Hongyon

hongyon,

This seems to be a working configure line, and it seems to make correctly as well. Thanks.

However, when I try and run even a hello world example (hellow.c) using mpirun_rsh, it crashes. It does run if you do mpdboot and mpiexec, which I’m guessing means mpich2 built correctly, but I can’t seem to get mpirun_rsh to do anything but crash, and I’d like to use mpirun_rsh if possible.

Have you encountered this with MVAPICH2?

Matt

mpirun_rsh requires -hostfile host_file_name

You will need to put the name of current node in file host_file_name.

Try this:

%hostname
myhost

From example above, put myhost in host_file_name file.

Also test it if ssh/rsh works correctly with following command.
% ssh myhost

OR/AND

%rsh myhost

If that rsh/ssh command does not work, then you will need some IT person to help you.

Then you can test just a simple script to see if mpirun_rsh works.
% cat >testme.sh
echo “Hello”
^C
%chmod +x testme.sh

%mpirun_rsh -np 2 -hostfile host_file_name ./testme.sh
hello
hello

If that works, then check your hello program.

If not, what is the OS you built your mpich on, which version of PGI, 32-bit or 64-bit? What exactly is in your hello program?

Are you trying to run on more than 1 node or trying to run on the OS that you didn’t built on? You should make sure that it runs on the system you built on.

Hongyon

Following you:

> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./testme.sh
Hello
Hello

So, that’s good. Now then, let’s try the MVAPICH2 hellow.c file:

> cat hellow.c 
#include <stdio.h>
#include "mpi.h"

int main( int argc, char *argv[] )
{
    int rank;
    int size;
    
    MPI_Init( 0, 0 );
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf( "Hello world from process %d of %d\n", rank, size );
    MPI_Finalize();
    return 0;
}

Simple enough program. Now I try to run it:

> ~/mvapich2/bin/mpicc -o hellow hellow.c
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on hostname
Exit code -5 signaled from hostname

where ‘hostname’ is the actual hostname of the computer that I’ve munged for safety’s sake.

And yet:

> ~/mvapich2/bin/mpdboot
> ~/mvapich2/bin/mpirun -np 2 ./hellow
Hello world from process 0 of 2
Hello world from process 1 of 2
> ~/mvapich2/bin/mpdallexit

This seems to indicate that MPICH2 built correctly inside MVAPICH2 (and I’ve tested both ch3:sock and ch3:nemesis). But, for some reason, it just doesn’t run mpirun_rsh. (Note: there are no build errors that I can detect in the make log.)

Hi,

In hello.c, MPI_Init should be called in one of following examples.

int main(int argc, char* argv) {
int rank;
int size;

MPI_Init(&argc, &argv);

OR

int main( int argc, char ***argv )
{
int rank;
int size;

MPI_Init( &argc, argv );


Failure probably comes from passing 0 instead of address of argc.

Hongyon

Hadn’t noticed that. Yet, fixing it does not seem to help:

> cat hellow.c 
#include <stdio.h>
#include "mpi.h"

int main( int argc, char *argv[] )
{
    int rank;
    int size;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf( "Hello world from process %d of %d\n", rank, size );
    MPI_Finalize();
    return 0;
}
> ~/mvapich2/bin/mpicc -o hellow hellow.c
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on hostname
Exit code -5 signaled from hostname

I did also try the other version with no change.

Matt

NB: I have also checked to make sure the MVAPICH2 lib directory is first in LD_LIBRARY_PATH. I saw that could be an issue, but no joy.

What is your OS? Please try with ch3:sock as it says nemesis is a work in progress.

Hongyon

Can you also try a full path to mpirun_rsh instead of ~/…/mpirun_rsh?
It could be a bug in mpirun_rsh.

Hongyon

First, I’m running on RHEL 5.5. Second, I recompiled mvapich2 from bare sources using device ch3:sock. To wit:

> ls -l /home/username/mvapich2
lrwxrwxrwx 1 username users 13 Jul 28 08:45 /home/username/mvapich2 -> mvapich2-sock/
> /home/username/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name /home/username/MPIExamples/hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: [cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 0) terminated unexpectedly on hostname
Exit code -5 signaled from hostname

where username and hostname have been munged. Even if I avoid the symlink and use

/home/username/mvapich2-sock/bin/mpirun_rsh

I still get this error.

Although, weirdly, with ch3:sock there seems to be twice as many errors as with ch3:nemesis. That is different; not good, but different.

I just remembered something. To even get mpirun_rsh to be this successful, I had to set MALLOC_CHECK_=0. If I don’t:

> env | grep MALL
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
*** glibc detected *** ./hellow: double free or corruption (fasttop): 0x0000000006f2a090 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3984a7230f]
/lib64/libc.so.6(cfree+0x4b)[0x3984a7276b]
./hellow[0x429ddd]
======= Memory map: ========
00400000-0045d000 r-xp 00000000 08:02 27166098                           /home/username/MPIExamples/hellow
0065c000-0069e000 rwxp 0005c000 08:02 27166098                           /home/username/MPIExamples/hellow
0069e000-006aa000 rwxp 0069e000 00:00 0 
06f27000-06f48000 rwxp 06f27000 00:00 0                                  [heap]
3984600000-398461c000 r-xp 00000000 08:01 1184266                        /lib64/ld-2.5.so
398481b000-398481c000 r-xp 0001b000 08:01 1184266                        /lib64/ld-2.5.so
398481c000-398481d000 rwxp 0001c000 08:01 1184266                        /lib64/ld-2.5.so
3984a00000-3984b4e000 r-xp 00000000 08:01 1184267                        /lib64/libc-2.5.so
3984b4e000-3984d4d000 ---p 0014e000 08:01 1184267                        /lib64/libc-2.5.so
3984d4d000-3984d51000 r-xp 0014d000 08:01 1184267                        /lib64/libc-2.5.so
3984d51000-3984d52000 rwxp 00151000 08:01 1184267                        /lib64/libc-2.5.so
3984d52000-3984d57000 rwxp 3984d52000 00:00 0 
3984e00000-3984e82000 r-xp 00000000 08:01 1184296                        /lib64/libm-2.5.so
3984e82000-3985081000 ---p 00082000 08:01 1184296                        /lib64/libm-2.5.so
3985081000-3985082000 r-xp 00081000 08:01 1184296                        /lib64/libm-2.5.so
3985082000-3985083000 rwxp 00082000 08:01 1184296                        /lib64/libm-2.5.so
3985600000-3985616000 r-xp 00000000 08:01 1184275                        /lib64/libpthread-2.5.so
3985616000-3985815000 ---p 00016000 08:01 1184275                        /lib64/libpthread-2.5.so
3985815000-3985816000 r-xp 00015000 08:01 1184275                        /lib64/libpthread-2.5.so
3985816000-3985817000 rwxp 00016000 08:01 1184275                        /lib64/libpthread-2.5.so
3985817000-398581b000 rwxp 3985817000 00:00 0 
3985e00000-3985e07000 r-xp 00000000 08:01 1184276                        /lib64/librt-2.5.so
3985e07000-3986007000 ---p 00007000 08:01 1184276                        /lib64/librt-2.5.so
3986007000-3986008000 r-xp 00007000 08:01 1184276                        /lib64/librt-2.5.so
3986008000-3986009000 rwxp 00008000 08:01 1184276                        /lib64/librt-2.5.so
398fc00000-398fc11000 r-xp 00000000 08:01 1184309                        /lib64/libresolv-2.5.so
398fc11000-398fe11000 ---p 00011000 08:01 1184309                        /lib64/libresolv-2.5.so
398fe11000-398fe12000 r-xp 00011000 08:01 1184309                        /lib64/libresolv-2.5.so
398fe12000-398fe13000 rwxp 00012000 08:01 1184309                        /lib64/libresolv-2.5.so
398fe13000-398fe15000 rwxp 398fe13000 00:00 0 
3995800000-399580d000 r-xp 00000000 08:01 1184319                        /lib64/libgcc_s-4.1.2-20080825.so.1
399580d000-3995a0d000 ---p 0000d000 08:01 1184319                        /lib64/libgcc_s-4.1.2-20080825.so.1
3995a0d000-3995a0e000 rwxp 0000d000 08:01 1184319                        /lib64/libgcc_s-4.1.2-20080825.so.1
2aacf1030000-2aacf1032000 rwxp 2aacf1030000 00:00 0 
2aacf104e000-2aacf1050000 rwxp 2aacf104e000 00:00 0 
2aacf1050000-2aacf105a000 r-xp 00000000 08:01 1184248                    /lib64/libnss_files-2.5.so
2aacf105a000-2aacf1259000 ---p 0000a000 08:01 1184248                    /lib64/libnss_files-2.5.so
2aacf1259000-2aacf125a000 r-xp 00009000 08:01 1184248                    /lib64/libnss_files-2.5.so
2aacf125a000-2aacf125b000 rwxp 0000a000 08:01 1184248                    /lib64/libnss_files-2.5.so
2aacf125b000-2aacf125f000 r-xp 00000000 08:01 1184246                    /lib64/libnss_dns-2.5.so
2aacf125f000-2aacf145e000 ---p 00004000 08:01 1184246                    /lib64/libnss_dns-2.5.so
2aacf145e000-2aacf145f000 r-xp 00003000 08:01 1184246                    /lib64/libnss_dns-2.5.so
2aacf145f000-2aacf1460000 rwxp 00004000 08:01 1184246                    /lib64/libnss_dns-2.5.so
7fffeb4ce000-7fffeb4e3000 rwxp 7ffffffea000 00:00 0                      [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]
MPI process (rank: 0) terminated unexpectedly on hostname
Exit code -5 signaled from hostname

Now, experimenting with MALLOC_CHECK_. First, equal to zero:

> MALLOC_CHECK_=0 ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on hostname
Exit code -5 signaled from hostname

and if MALLOC_CHECK_=1:

> MALLOC_CHECK_=1 ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
*** glibc detected *** ./hellow: free(): invalid pointer: 0x000000001db02180 ***
*** glibc detected *** ./hellow: free(): invalid pointer: 0x000000001db02180 ***
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
*** glibc detected *** ./hellow: free(): invalid pointer: 0x0000000006103180 ***
MPI process (rank: 0) terminated unexpectedly on hostname
*** glibc detected *** ./hellow: free(): invalid pointer: 0x0000000006103180 ***
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Exit code -5 signaled from hostname
malloc: using debugging hooks

Finally, for your edification:

> ~/mvapich2/bin/mpirun_rsh -show -np 2 -hostfile host_file_name ./hellow

/bin/bash -c cd /home/username/MPIExamples; /usr/bin/env LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/home/username/mvapich2/lib:/home/username/lib:/opt/pgi/linux86-64/2010/cuda/lib:/opt/pgi/linux86-64/2010/cuda/open64/lib:/opt/pgi/linux86-64/2010/lib:/opt/pgi/linux86-64/2010/libso:/opt/cuda/lib64: MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=hostname MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=52278 MPISPAWN_MPIRUN_PORT=52278 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=6397 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_332_hostname_6397 MPISPAWN_LOCAL_NPROCS=2 MPISPAWN_ARGV_0=./hellow MPISPAWN_GENERIC_ENV_COUNT=1  MPISPAWN_GENERIC_NAME_0=MV2_XRC_FILE MPISPAWN_GENERIC_VALUE_0=mv2_xrc_226_hostname_6397 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/username/MPIExamples MPISPAWN_MPIRUN_RANK_0=0 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 MPISPAWN_MPIRUN_RANK_1=1 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1  /home/username/mvapich2-sock/bin/mpispawn 0

We will get RHEL5.5 install here asap. We have tested with RHEL5.3 and had no problem.


Hongyon

Hi,

We tried with RHEL5.5 and still see no problem with 64-bit 10.6. Which vession of PGI do you use? 32-bit, 64-bit, 10.5, 10.6? I guess we need to start from the beginning.

This is exactly the steps I do in csh:

  1. Install PGI in my home directory(anywhere should be fine).

  2. setenv PGI /home/my_home_dir/pgi

  3. setenv PATH /home/my_home_dir/pgi/linux86-64/2010/bin:$PATH

  4. cd to_mvapich2_dir

  5. env CC=pgcc FC=pgfortran F77=pgfortran CXX=pgcpp CFLAGS=-fast FCFLAGS=-fast FFLAGS=-fast
    CXXFLAGS=-fast ./configure --prefix=/home/my_home_dir/mvapich/mympich2 --with-device=ch3:sock >& configure.log

  6. make

  7. make install

  8. cd ~/mytest_dir

  9. check hostme file

rhel55% more hostme
rhel55
rhel55% hostname
rhel55

  1. compile and run

rhel55% /home/hongyon/mvapich/mympich2/bin/mpicc hello_mpi.c
rhel55% /home/hongyon/mvapich/mympich2/bin/mpirun_rsh -np 2 -hostfile hostme ./a.out
Hello world from process 0 of 2
Hello world from process 1 of 2

Please try those steps. If there still a problem, then I am really at my wit’s end.

Might want to ask MPICH2 folks. Also check out at:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-580009.3.4

Not sure if this is related to a problem.

9.3.4 Creation of CQ or QP failure

A possible reason could be inability to pin the memory required. Make sure the following steps are taken.

  1. In /etc/security/limits.conf add the following
  • soft memlock phys_mem_in_KB
  1. After this, add the following to /etc/init.d/sshd
    ulimit -l phys_mem_in_KB

  2. Restart sshd

With some distros, we’ve found that adding the ulimit -l line to the sshd init script is no longer necessary. For instance, the following steps work for our rhel5 systems.

  1. Add the following lines to /etc/security/limits.conf
  • soft memlock unlimited
  • hard memlock unlimited
  1. Restart sshd

Hongyon

Can you please try with -O2 instead of -fast?

Hongyon

I tried it with no flags at all, still no joy even following your example. Likewise for unlimiting locked memory, still no luck.

Looks like I’ll need to move to the MVAPICH mailing list for help with this. If I ever solve this issue, I’ll add a reply here. Sooner or later, I’ll hopefully be back asking questions on linking to MVAPICH2. That’ll be nice.

Thank you. If we encounter the problem or have any clue we will let you know.

Hongyon

Hello, just checking to see if you found a solution to this error? We are experiencing the same on SLES11 SP1

I haven’t had any problem building the latest MVAPICH2s in recent times. I last tried with 1.7a2 and seemed to have no problem.

If I could ask for some info: Are you using the latest PGI and MVAPICH? What device are you targeting? How did you configure?