mpirun issue

garlay · October 27, 2008, 10:13pm

hi fellows i have installed a pgi 7.0-7 - mm5v3 and am using the mpi that the pgi brings to me so after i finish to install i run the example cpi using the command mpirun -v -nolocal -np 40 cpi (40 because i have 10 servers or nodes with a intel Xeon quad core so basicaly i have 4 cpu per node)

when i run this example everything goes cool, but when i go to mm5 and i try to run mpirun -v -nolocal -np 10 mm5.mpp (10 because i only want to probe i process per node ) i got a error like

net_send: could not write to fd=5, errno = 32

this only happend when i use -np > 9 when -np < 9 everything is fine.

some body knows how can i fix this because am desperate.

garlay · October 27, 2008, 10:33pm

[mm5v3k@cl2master test]$ ./kenia.deck
(cd Run; make -i -r mmlif);
make[1]: Entering directory /data/mm5v3k/mm5v3-kenia/MM5/Run' cat < oparam > mmlif cat < lparam >> mmlif echo " IFRAD = echo “2,0,0,0,0”|cut -d, -f1,">>mmlif echo " ICUPA = "6,6,1,1,1,1,1,1,1,1",">>mmlif echo " IMPHYS = "4,4,1,1,1,1,1,1,1,1" ,">>mmlif echo " IBLTYP = "5,5,0,0,0,0,0,0,0,0",">>mmlif echo " ISHALLO = "0,0,0,0,0,0,0,0,0,0",">>mmlif echo " IPOLAR = 0,">>mmlif echo " ISOIL = 1,">>mmlif if [ ""linux"" = "IBM" ]; then \ echo " / ">>mmlif; \ elif [ ""linux"" = "sp2" ]; then \ echo " / ">>mmlif; \ elif [ ""linux"" = "HP" ]; then \ echo ' $END '>>mmlif; \ else \ echo " &END ">>mmlif; \ fi; cat < nparam >> mmlif cat < pparam >> mmlif cat < fparam >> mmlif make[1]: Leaving directory /data/mm5v3k/mm5v3-kenia/MM5/Run’
This version of mm5.deck stops after creating namelist file mmlif.
Please run code manually.
Mon Oct 27 16:54:06 EST 2008
running /data/mm5v3k/mm5v3-kenia/MM5/Run/mm5.mpp on 8 LINUX ch_p4 processors
Created /data/mm5v3k/mm5v3-kenia/MM5/Run/PI1830
node1 – rsl_nproc_all 8, rsl_myproc 0
node2 – rsl_nproc_all 8, rsl_myproc 1
node3 – rsl_nproc_all 8, rsl_myproc 2
node5 – rsl_nproc_all 8, rsl_myproc 4
node6 – rsl_nproc_all 8, rsl_myproc 5
node8 – rsl_nproc_all 8, rsl_myproc 7
node4 – rsl_nproc_all 8, rsl_myproc 3
node7 – rsl_nproc_all 8, rsl_myproc 6
rm_l_3_23458: (1969.804688) net_send: could not write to fd=5, errno = 32
rm_l_7_2332: (1968.929688) net_send: could not write to fd=5, errno = 32
rm_l_1_4222: (1970.253906) net_send: could not write to fd=5, errno = 32
rm_l_2_3760: (1970.035156) net_send: could not write to fd=7, errno = 32
rm_l_4_2499: (1969.601562) net_send: could not write to fd=5, errno = 32
rm_l_5_2746: (1969.371094) net_send: could not write to fd=5, errno = 32
P4 procgroup file is /data/mm5v3k/mm5v3-kenia/MM5/Run/PI1830.

MatColgrove · October 27, 2008, 10:50pm

Hi garlay,

A “net_send: could not write to fd=5, errno = 32” error typically means that one of your processes have died and the remaining are getting errors when sending it messages. Why the process is dying, I don’t know. However, given that it’s MM5, I would first check that you have enough stack size by setting your environments stack size to unlimited.

Mat

garlay · October 28, 2008, 2:12am

i have this on my .tcshrc

limit stacksize unlimited

garlay · October 28, 2008, 2:35am

look if i write

[mm5v3k@cl2master /etc]# limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 32 kbytes
maxproc 100
[mm5v3k@cl2master /etc]#

what do you think mkcolg

MatColgrove · October 28, 2008, 6:13pm

Hi Garlay,

I’ve looked at several MM5 failures and almost all of them we’re due to stack overflows. I have also seen several cases, when the problem size was simply too big for the available memory. Even ‘unlimit’ stack space has a limit.

Granted your issue is a bit different in that it works with 9 processes but fails at 10. If you have the PGI CDK product, I would recompile MM5 with “-g”, link with the PGI MPICH library, and run the code in the PGI debugger PGDBG. This would give you a starting point.

You could also be having an issue with one of you cluster’s nodes. Is anything different with the 10th node? What happens if you change your machines file so that the 10th node is first?

Mat

garlay · October 31, 2008, 5:07pm

i make a search in /var/log/ message file in all of my nodes and i found some kind a error in node1 that is the node that start all the process. So the error is this:

Oct 20 11:33:35 node1 kernel: BUG: soft lockup detected on CPU#1!
Oct 20 11:33:35 node1 kernel:
Oct 20 11:33:35 node1 kernel: Call Trace:
Oct 20 11:33:35 node1 kernel: [] show_trace+0x34/0x47
Oct 20 11:33:35 node1 kernel: [] dump_stack+0x12/0x17
Oct 20 11:33:35 node1 kernel: [] softlockup_tick+0xdb/0xf6
Oct 20 11:33:35 node1 kernel: [] update_process_times+0x42/0x68
Oct 20 11:33:35 node1 kernel: [] smp_local_timer_interrupt+0x23/0x47
Oct 20 11:33:35 node1 kernel: [] smp_apic_timer_interrupt+0x41/0x47
Oct 20 11:33:35 node1 kernel: [] apic_timer_interrupt+0x66/0x6c
Oct 20 11:33:35 node1 kernel: DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Oct 20 11:33:35 node1 kernel: Leftover inexact backtrace:
Oct 20 11:33:35 node1 kernel: [] security_port_sid+0x32/0x98
Oct 20 11:33:35 node1 kernel: [] security_port_sid+0x1e/0x98
Oct 20 11:33:35 node1 kernel: [] selinux_ip_postroute_last+0x186/0x1d0
Oct 20 11:33:35 node1 kernel: [] :bnx2:bnx2_start_xmit+0x28c/0x4cc
Oct 20 11:33:35 node1 kernel: [] nf_iterate+0x41/0x7d
Oct 20 11:33:35 node1 kernel: [] ip_finish_output+0x0/0x1a8
Oct 20 11:33:35 node1 kernel: [] nf_hook_slow+0x5d/0xbf
Oct 20 11:33:35 node1 kernel: [] ip_finish_output+0x0/0x1a8
Oct 20 11:33:35 node1 kernel: [] ip_output+0xa4/0x249
Oct 20 11:33:35 node1 kernel: [] ip_queue_xmit+0x400/0x455
Oct 20 11:33:35 node1 kernel: [] cache_alloc_refill+0x125/0x192
Oct 20 11:33:35 node1 kernel: [] tcp_transmit_skb+0x653/0x68b
Oct 20 11:33:35 node1 kernel: [] __alloc_skb+0x77/0x123
Oct 20 11:33:35 node1 kernel: [] tcp_rcv_established+0x727/0x917
“messages.1” 1691L, 131381C

MatColgrove · October 31, 2008, 7:40pm

Hi Garlay,

My guess is that when one process dies, the kernel error is caused by reading from the now dead socket connection. In other words, this is just a symptom and not the main problem. Though, I’m not a kernel expert so my thoughts are strictly based on 15 minutes of googling the string “soft lockup detected”. My impression is that this is a generic error message that can be caused by any number of problems. While it could be a kernel bug, problem with your network card, or your system’s memory, I would think you’d have seen problems while running other codes, not just MM5. If you can, you should definitely use a debugger.

Mat

Topic		Replies	Views
MM5.MPP error messages Legacy PGI Compilers	1	2539	July 26, 2010
compile errors with MM5 Legacy PGI Compilers	6	14685	June 24, 2006
MPI problems with pgi 7.0-7 Legacy PGI Compilers	2	10420	August 27, 2007
MPI problem with PGI CDK in cluster environment Legacy PGI Compilers	3	5448	June 14, 2010
MPI error Legacy PGI Compilers	3	2799	June 8, 2011
mpich problem Legacy PGI Compilers	1	3167	July 2, 2010
mpirun limit on number of processors Legacy PGI Compilers	3	4094	December 12, 2013
CUDA+MPI error on workstation Legacy PGI Compilers	4	3915	December 21, 2012
runing on pgi 7.2 mpich Legacy PGI Compilers	2	3904	March 6, 2009
MPI_WAIT error Legacy PGI Compilers	3	15288	July 10, 2007

mpirun issue

Related topics