(This might not be the right forum for this, but it seemed best to me. If not, please move.)
I was wondering if anyone at PGI or in the Forum Community had a “best” build strategy for MVAPICH2 1.5 with PGI 2010, a la Open MPI, say:
http://www.pgroup.com/resources/openmpi/openmpi141_pgi2010.htm
I’m just building for one node, so I’m looking mainly at just the MPI bit of MVAPICH2 without the need for all the Infiniband/OFED/etc. I know it’s odd to use MVAPICH2 on a single node, but it’s what’s used in large clusters here, so I thought I’d be nice to troubleshoot building my code with a PGI+MVAPICH2 environment in a more controlled area (and I assume the OFED is done correctly on the larger clusters maintained by specialists).
I’m also looking for optimizations that have worked. Is -fast viable, or do you need to dial down?
Hi,
I would look at:
http://www.pgroup.com/resources/mpich/mpich2_121_pgi2010.htm
Example configuration for mvapich2 would be:
env CC=pgcc FC=pgfortran F77=pgfortran CXX=pgcpp CFLAGS=-fast FCFLAGS=-fast FFLAGS=-fast
CXXFLAGS=-fast ./configure --with-device=ch3:sock --prefix=/usr/local/mpich2 >& configure.log
Good luck.
Hongyon
hongyon,
This seems to be a working configure line, and it seems to make correctly as well. Thanks.
However, when I try and run even a hello world example (hellow.c) using mpirun_rsh, it crashes. It does run if you do mpdboot and mpiexec, which I’m guessing means mpich2 built correctly, but I can’t seem to get mpirun_rsh to do anything but crash, and I’d like to use mpirun_rsh if possible.
Have you encountered this with MVAPICH2?
Matt
mpirun_rsh requires -hostfile host_file_name
You will need to put the name of current node in file host_file_name.
Try this:
%hostname
myhost
From example above, put myhost in host_file_name file.
Also test it if ssh/rsh works correctly with following command.
% ssh myhost
OR/AND
%rsh myhost
If that rsh/ssh command does not work, then you will need some IT person to help you.
Then you can test just a simple script to see if mpirun_rsh works.
% cat >testme.sh
echo “Hello”
^C
%chmod +x testme.sh
%mpirun_rsh -np 2 -hostfile host_file_name ./testme.sh
hello
hello
If that works, then check your hello program.
If not, what is the OS you built your mpich on, which version of PGI, 32-bit or 64-bit? What exactly is in your hello program?
Are you trying to run on more than 1 node or trying to run on the OS that you didn’t built on? You should make sure that it runs on the system you built on.
Hongyon
Following you:
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./testme.sh
Hello
Hello
So, that’s good. Now then, let’s try the MVAPICH2 hellow.c file:
> cat hellow.c
#include <stdio.h>
#include "mpi.h"
int main( int argc, char *argv[] )
{
int rank;
int size;
MPI_Init( 0, 0 );
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
Simple enough program. Now I try to run it:
> ~/mvapich2/bin/mpicc -o hellow hellow.c
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on hostname
Exit code -5 signaled from hostname
where ‘hostname’ is the actual hostname of the computer that I’ve munged for safety’s sake.
And yet:
> ~/mvapich2/bin/mpdboot
> ~/mvapich2/bin/mpirun -np 2 ./hellow
Hello world from process 0 of 2
Hello world from process 1 of 2
> ~/mvapich2/bin/mpdallexit
This seems to indicate that MPICH2 built correctly inside MVAPICH2 (and I’ve tested both ch3:sock and ch3:nemesis). But, for some reason, it just doesn’t run mpirun_rsh. (Note: there are no build errors that I can detect in the make log.)
Hi,
In hello.c, MPI_Init should be called in one of following examples.
int main(int argc, char* argv ) {
int rank;
int size;
MPI_Init(&argc, &argv);
OR
int main( int argc, char ***argv )
{
int rank;
int size;
MPI_Init( &argc, argv );
Failure probably comes from passing 0 instead of address of argc.
Hongyon
Hadn’t noticed that. Yet, fixing it does not seem to help:
> cat hellow.c
#include <stdio.h>
#include "mpi.h"
int main( int argc, char *argv[] )
{
int rank;
int size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
> ~/mvapich2/bin/mpicc -o hellow hellow.c
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on hostname
Exit code -5 signaled from hostname
I did also try the other version with no change.
Matt
NB: I have also checked to make sure the MVAPICH2 lib directory is first in LD_LIBRARY_PATH. I saw that could be an issue, but no joy.
What is your OS? Please try with ch3:sock as it says nemesis is a work in progress.
Hongyon
Can you also try a full path to mpirun_rsh instead of ~/…/mpirun_rsh?
It could be a bug in mpirun_rsh.
Hongyon
First, I’m running on RHEL 5.5. Second, I recompiled mvapich2 from bare sources using device ch3:sock. To wit:
> ls -l /home/username/mvapich2
lrwxrwxrwx 1 username users 13 Jul 28 08:45 /home/username/mvapich2 -> mvapich2-sock/
> /home/username/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name /home/username/MPIExamples/hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: [cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 0) terminated unexpectedly on hostname
Exit code -5 signaled from hostname
where username and hostname have been munged. Even if I avoid the symlink and use
/home/username/mvapich2-sock/bin/mpirun_rsh
I still get this error.
Although, weirdly, with ch3:sock there seems to be twice as many errors as with ch3:nemesis. That is different; not good, but different.
I just remembered something. To even get mpirun_rsh to be this successful, I had to set MALLOC_CHECK_=0. If I don’t:
> env | grep MALL
> ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
*** glibc detected *** ./hellow: double free or corruption (fasttop): 0x0000000006f2a090 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3984a7230f]
/lib64/libc.so.6(cfree+0x4b)[0x3984a7276b]
./hellow[0x429ddd]
======= Memory map: ========
00400000-0045d000 r-xp 00000000 08:02 27166098 /home/username/MPIExamples/hellow
0065c000-0069e000 rwxp 0005c000 08:02 27166098 /home/username/MPIExamples/hellow
0069e000-006aa000 rwxp 0069e000 00:00 0
06f27000-06f48000 rwxp 06f27000 00:00 0 [heap]
3984600000-398461c000 r-xp 00000000 08:01 1184266 /lib64/ld-2.5.so
398481b000-398481c000 r-xp 0001b000 08:01 1184266 /lib64/ld-2.5.so
398481c000-398481d000 rwxp 0001c000 08:01 1184266 /lib64/ld-2.5.so
3984a00000-3984b4e000 r-xp 00000000 08:01 1184267 /lib64/libc-2.5.so
3984b4e000-3984d4d000 ---p 0014e000 08:01 1184267 /lib64/libc-2.5.so
3984d4d000-3984d51000 r-xp 0014d000 08:01 1184267 /lib64/libc-2.5.so
3984d51000-3984d52000 rwxp 00151000 08:01 1184267 /lib64/libc-2.5.so
3984d52000-3984d57000 rwxp 3984d52000 00:00 0
3984e00000-3984e82000 r-xp 00000000 08:01 1184296 /lib64/libm-2.5.so
3984e82000-3985081000 ---p 00082000 08:01 1184296 /lib64/libm-2.5.so
3985081000-3985082000 r-xp 00081000 08:01 1184296 /lib64/libm-2.5.so
3985082000-3985083000 rwxp 00082000 08:01 1184296 /lib64/libm-2.5.so
3985600000-3985616000 r-xp 00000000 08:01 1184275 /lib64/libpthread-2.5.so
3985616000-3985815000 ---p 00016000 08:01 1184275 /lib64/libpthread-2.5.so
3985815000-3985816000 r-xp 00015000 08:01 1184275 /lib64/libpthread-2.5.so
3985816000-3985817000 rwxp 00016000 08:01 1184275 /lib64/libpthread-2.5.so
3985817000-398581b000 rwxp 3985817000 00:00 0
3985e00000-3985e07000 r-xp 00000000 08:01 1184276 /lib64/librt-2.5.so
3985e07000-3986007000 ---p 00007000 08:01 1184276 /lib64/librt-2.5.so
3986007000-3986008000 r-xp 00007000 08:01 1184276 /lib64/librt-2.5.so
3986008000-3986009000 rwxp 00008000 08:01 1184276 /lib64/librt-2.5.so
398fc00000-398fc11000 r-xp 00000000 08:01 1184309 /lib64/libresolv-2.5.so
398fc11000-398fe11000 ---p 00011000 08:01 1184309 /lib64/libresolv-2.5.so
398fe11000-398fe12000 r-xp 00011000 08:01 1184309 /lib64/libresolv-2.5.so
398fe12000-398fe13000 rwxp 00012000 08:01 1184309 /lib64/libresolv-2.5.so
398fe13000-398fe15000 rwxp 398fe13000 00:00 0
3995800000-399580d000 r-xp 00000000 08:01 1184319 /lib64/libgcc_s-4.1.2-20080825.so.1
399580d000-3995a0d000 ---p 0000d000 08:01 1184319 /lib64/libgcc_s-4.1.2-20080825.so.1
3995a0d000-3995a0e000 rwxp 0000d000 08:01 1184319 /lib64/libgcc_s-4.1.2-20080825.so.1
2aacf1030000-2aacf1032000 rwxp 2aacf1030000 00:00 0
2aacf104e000-2aacf1050000 rwxp 2aacf104e000 00:00 0
2aacf1050000-2aacf105a000 r-xp 00000000 08:01 1184248 /lib64/libnss_files-2.5.so
2aacf105a000-2aacf1259000 ---p 0000a000 08:01 1184248 /lib64/libnss_files-2.5.so
2aacf1259000-2aacf125a000 r-xp 00009000 08:01 1184248 /lib64/libnss_files-2.5.so
2aacf125a000-2aacf125b000 rwxp 0000a000 08:01 1184248 /lib64/libnss_files-2.5.so
2aacf125b000-2aacf125f000 r-xp 00000000 08:01 1184246 /lib64/libnss_dns-2.5.so
2aacf125f000-2aacf145e000 ---p 00004000 08:01 1184246 /lib64/libnss_dns-2.5.so
2aacf145e000-2aacf145f000 r-xp 00003000 08:01 1184246 /lib64/libnss_dns-2.5.so
2aacf145f000-2aacf1460000 rwxp 00004000 08:01 1184246 /lib64/libnss_dns-2.5.so
7fffeb4ce000-7fffeb4e3000 rwxp 7ffffffea000 00:00 0 [stack]
ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso]
MPI process (rank: 0) terminated unexpectedly on hostname
Exit code -5 signaled from hostname
Now, experimenting with MALLOC_CHECK_. First, equal to zero:
> MALLOC_CHECK_=0 ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
MPI process (rank: 1) terminated unexpectedly on hostname
Exit code -5 signaled from hostname
and if MALLOC_CHECK_=1:
> MALLOC_CHECK_=1 ~/mvapich2/bin/mpirun_rsh -np 2 -hostfile host_file_name ./hellow
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
*** glibc detected *** ./hellow: free(): invalid pointer: 0x000000001db02180 ***
*** glibc detected *** ./hellow: free(): invalid pointer: 0x000000001db02180 ***
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_0]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
*** glibc detected *** ./hellow: free(): invalid pointer: 0x0000000006103180 ***
MPI process (rank: 0) terminated unexpectedly on hostname
*** glibc detected *** ./hellow: free(): invalid pointer: 0x0000000006103180 ***
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
[cli_1]: aborting job:
Fatal error in MPI_Init: Invalid buffer pointer, error stack:
MPIR_Init_thread(411): Initialization failed
(unknown)(): Invalid buffer pointer
Exit code -5 signaled from hostname
malloc: using debugging hooks
Finally, for your edification:
> ~/mvapich2/bin/mpirun_rsh -show -np 2 -hostfile host_file_name ./hellow
/bin/bash -c cd /home/username/MPIExamples; /usr/bin/env LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/home/username/mvapich2/lib:/home/username/lib:/opt/pgi/linux86-64/2010/cuda/lib:/opt/pgi/linux86-64/2010/cuda/open64/lib:/opt/pgi/linux86-64/2010/lib:/opt/pgi/linux86-64/2010/libso:/opt/cuda/lib64: MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=hostname MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=52278 MPISPAWN_MPIRUN_PORT=52278 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=6397 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_332_hostname_6397 MPISPAWN_LOCAL_NPROCS=2 MPISPAWN_ARGV_0=./hellow MPISPAWN_GENERIC_ENV_COUNT=1 MPISPAWN_GENERIC_NAME_0=MV2_XRC_FILE MPISPAWN_GENERIC_VALUE_0=mv2_xrc_226_hostname_6397 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/username/MPIExamples MPISPAWN_MPIRUN_RANK_0=0 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 MPISPAWN_MPIRUN_RANK_1=1 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 /home/username/mvapich2-sock/bin/mpispawn 0
We will get RHEL5.5 install here asap. We have tested with RHEL5.3 and had no problem.
Hongyon
Hi,
We tried with RHEL5.5 and still see no problem with 64-bit 10.6. Which vession of PGI do you use? 32-bit, 64-bit, 10.5, 10.6? I guess we need to start from the beginning.
This is exactly the steps I do in csh:
Install PGI in my home directory(anywhere should be fine).
setenv PGI /home/my_home_dir/pgi
setenv PATH /home/my_home_dir/pgi/linux86-64/2010/bin:$PATH
cd to_mvapich2_dir
env CC=pgcc FC=pgfortran F77=pgfortran CXX=pgcpp CFLAGS=-fast FCFLAGS=-fast FFLAGS=-fast
CXXFLAGS=-fast ./configure --prefix=/home/my_home_dir/mvapich/mympich2 --with-device=ch3:sock >& configure.log
make
make install
cd ~/mytest_dir
check hostme file
rhel55% more hostme
rhel55
rhel55% hostname
rhel55
compile and run
rhel55% /home/hongyon/mvapich/mympich2/bin/mpicc hello_mpi.c
rhel55% /home/hongyon/mvapich/mympich2/bin/mpirun_rsh -np 2 -hostfile hostme ./a.out
Hello world from process 0 of 2
Hello world from process 1 of 2
Please try those steps. If there still a problem, then I am really at my wit’s end.
Might want to ask MPICH2 folks. Also check out at:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-580009.3.4
Not sure if this is related to a problem.
9.3.4 Creation of CQ or QP failure
A possible reason could be inability to pin the memory required. Make sure the following steps are taken.
In /etc/security/limits.conf add the following
soft memlock phys_mem_in_KB
After this, add the following to /etc/init.d/sshd
ulimit -l phys_mem_in_KB
Restart sshd
With some distros, we’ve found that adding the ulimit -l line to the sshd init script is no longer necessary. For instance, the following steps work for our rhel5 systems.
Add the following lines to /etc/security/limits.conf
soft memlock unlimited
hard memlock unlimited
Restart sshd
Hongyon
Can you please try with -O2 instead of -fast?
Hongyon
I tried it with no flags at all, still no joy even following your example. Likewise for unlimiting locked memory, still no luck.
Looks like I’ll need to move to the MVAPICH mailing list for help with this. If I ever solve this issue, I’ll add a reply here. Sooner or later, I’ll hopefully be back asking questions on linking to MVAPICH2. That’ll be nice.
Thank you. If we encounter the problem or have any clue we will let you know.
Hongyon
TheMatt:
I tried it with no flags at all, still no joy even following your example. Likewise for unlimiting locked memory, still no luck.
Looks like I’ll need to move to the MVAPICH mailing list for help with this. If I ever solve this issue, I’ll add a reply here. Sooner or later, I’ll hopefully be back asking questions on linking to MVAPICH2. That’ll be nice.
Hello, just checking to see if you found a solution to this error? We are experiencing the same on SLES11 SP1
I haven’t had any problem building the latest MVAPICH2s in recent times. I last tried with 1.7a2 and seemed to have no problem.
If I could ask for some info: Are you using the latest PGI and MVAPICH? What device are you targeting? How did you configure?