pgi distribution:mpiexec collective abort on linux cluster

Hi,
I am trying to run a multiple executable[5] climate model on our 64 bit Intel xeon cluster[6 nodes each with 3Gz] with mpi2 [version distributed with PGI 7.2-5] The program compiles without errors, but during the run using mpiexec it shows,

“rank 0 in job 1 xxxx_42400 caused collective abort of all ranks
exit status of rank 0: killed by signal 9”
What could be the possible issue? Any help would be appreciated…
Thanks,


PS.I’m using an ethernet for communication btwn the nodes.

Hi Mpinewbie597,

exit status of rank 0: killed by signal 9

A signal 9 is the ‘kill’ signal, meaning that someone, or more likely the OS, issued a ‘kill -9 ’ on your process. If a system resource limit has been reached (such as memory), the OS will try to recover by killing the processes that are using the most of that resource.

So, my best guess as to the problem is that your application used more memory or cputime than was allowed or available.

Hope this helps,
Mat

Hi Mat,
Thank you for the reply, I’ll check if its a memory problem then… Right now I have set the stacksize limit unlimited to all nodes, but I’m not sure if that works ,as I get same error after too, also I don’t have a clue how to set the memory limits in the mpiexec command line , if anyone here knows about it, please leave a reply.

Hi Mpinewbie597,

A stack overflow would cause a different error. The OS would kill your process if you’re system ran out of memory or you hit the ‘memoryuse’ limit.

  • Mat