MPI error

i use 11.7 PGI.Acc.Fortran, and when i use mpi+cuda fortran i encounted a problem, the slurm file which contains execuation information said: /home/bin/pgi/linux86-64/2011/mpi/mpich/bin/mpirun.ch_p4:line 243 3053 killed.

and i found out this problem was not always appear. some jobs can finished sucessfully, and some jobs can not.

plus, i used one cpu and one gpu to execute mpi work, for test.

thans very much!

Hi zsh,

This means that one of your MPI processes was killed or crashed unexpectedly. This could be caused by resource limits on your cluster, MPI configuration, program errors, etc. Basically, it could be any number of problems.

I would first start by running a single process (which you’ve done) and then run 2, 4, etc. untill the crash occurs. Try and limit your program to run on a single node and then run again on multiple nodes. Since your using CUDA Fortran, the problem may be with a particular GPU or oversubscribing GPUs (until the K20 is out, each MPI Process should have it’s own GPU). If you think it may be a program error, you can compile in emulation mode (-g -Mcuda=emu) and then run your program in the PGI debugger, pgdbg. PGDBG is able to run MPI process. If you have a CDK license, then you can even run pgdbg accross multiple nodes.

Hope this helps,
Mat

thanks matt, i will figure out which part caused this error!

really thanks for your suggestion!