MPI_WAIT error

I am trying to run in parallel a meteorological model (RAMSV.6) which was compiled with PGI in Dell Cluster (9 processors) into a LInux environment(SUSE Linux Enterprise Server 10).
I have run it succesfully in other occasions. Now, I am executing a meteorological simulation with a larger dataset and…it runs OK for a while, but then I get the following error messages related to MPI:

r_adiation tendencies updated time = 10800.0 UTC TIME (HRS) = 3.0
rank 8 in job 13 n0_55134 caused collective abort of all ranks
exit status of rank 8: killed by signal 11
[cli_4]: aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(140)…: MPI_Wait(request=0x7fffe3a1132c, status0x7fffe3a11330) failed
MPIDI_CH3_Progress_wait(212)…: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDU_Socki_handle_read(633)…: connection failure (set=0,sock=5,errno=104:Connection reset by peer)_

Any idea?
Thanks in advance,

Hi eee,

It sounds like a stack overflow, but you’ll need to run the program with the debugger to be sure. Try setting the stack size to unlimited in your shell’s start-up file to see if this works around the problem.

  • Mat

I don´t know how can I run the program with the debigger, I´m not very experienced.
The program was compiled with the following options:

F_OPTS=-Mvect=cachesize:524288 -Munroll -Mnoframe -O2 -pc 64
LOADER_OPTS=-v -lgcc_eh -lpthread
LIBS=-L/opt/pgi/linux86-64/6.2/lib -L/opt/pgi/linux86-64/6.2/libso

I can post additional information if that will help track down the problem.
Any suggestion is greatly appreciated…

Hi Estibaliz,

First try increasing the available stack size to see if it corrects the problem. To do this add “ulimit -s unlimited” to your home directory’s “.bashrc” file if you’re using the bash shell, or “limit stacksize unlimited” in your “.cshrc” file if you’re using TCSH/CSH.

As for using the PGI debugger, please refer to the PGI Tool’s Guide for detailed information. Note that you must have the PGI CDK product to use the MPI debugging feature.

  • Mat[/url]