I am trying to run in parallel a meteorological model (RAMSV.6) which was compiled with PGI in Dell Cluster (9 processors) into a LInux environment(SUSE Linux Enterprise Server 10).
I have run it succesfully in other occasions. Now, I am executing a meteorological simulation with a larger dataset and…it runs OK for a while, but then I get the following error messages related to MPI:
r_adiation tendencies updated time = 10800.0 UTC TIME (HRS) = 3.0
rank 8 in job 13 n0_55134 caused collective abort of all ranks
exit status of rank 8: killed by signal 11
[cli_4]: aborting job:
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(140)…: MPI_Wait(request=0x7fffe3a1132c, status0x7fffe3a11330) failed
MPIDI_CH3_Progress_wait(212)…: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)…: connection failure (set=0,sock=5,errno=104:Connection reset by peer)_
[/i]
It sounds like a stack overflow, but you’ll need to run the program with the debugger to be sure. Try setting the stack size to unlimited in your shell’s start-up file to see if this works around the problem.
First try increasing the available stack size to see if it corrects the problem. To do this add “ulimit -s unlimited” to your home directory’s “.bashrc” file if you’re using the bash shell, or “limit stacksize unlimited” in your “.cshrc” file if you’re using TCSH/CSH.
As for using the PGI debugger, please refer to the PGI Tool’s Guide for detailed information. Note that you must have the PGI CDK product to use the MPI debugging feature.