mpirun issue

hi fellows i have installed a pgi 7.0-7 - mm5v3 and am using the mpi that the pgi brings to me so after i finish to install i run the example cpi using the command mpirun -v -nolocal -np 40 cpi (40 because i have 10 servers or nodes with a intel Xeon quad core so basicaly i have 4 cpu per node)

when i run this example everything goes cool, but when i go to mm5 and i try to run mpirun -v -nolocal -np 10 mm5.mpp (10 because i only want to probe i process per node ) i got a error like

net_send: could not write to fd=5, errno = 32

this only happend when i use -np > 9 when -np < 9 everything is fine.

some body knows how can i fix this because am desperate.

[mm5v3k@cl2master test]$ ./kenia.deck
(cd Run; make -i -r mmlif);
make[1]: Entering directory /data/mm5v3k/mm5v3-kenia/MM5/Run' cat < oparam > mmlif cat < lparam >> mmlif echo " IFRAD = echo “2,0,0,0,0”|cut -d, -f1,">>mmlif echo " ICUPA = "6,6,1,1,1,1,1,1,1,1",">>mmlif echo " IMPHYS = "4,4,1,1,1,1,1,1,1,1" ,">>mmlif echo " IBLTYP = "5,5,0,0,0,0,0,0,0,0",">>mmlif echo " ISHALLO = "0,0,0,0,0,0,0,0,0,0",">>mmlif echo " IPOLAR = 0,">>mmlif echo " ISOIL = 1,">>mmlif if [ ""linux"" = "IBM" ]; then \ echo " / ">>mmlif; \ elif [ ""linux"" = "sp2" ]; then \ echo " / ">>mmlif; \ elif [ ""linux"" = "HP" ]; then \ echo ' $END '>>mmlif; \ else \ echo " &END ">>mmlif; \ fi; cat < nparam >> mmlif cat < pparam >> mmlif cat < fparam >> mmlif make[1]: Leaving directory /data/mm5v3k/mm5v3-kenia/MM5/Run’
This version of mm5.deck stops after creating namelist file mmlif.
Please run code manually.
Mon Oct 27 16:54:06 EST 2008
running /data/mm5v3k/mm5v3-kenia/MM5/Run/mm5.mpp on 8 LINUX ch_p4 processors
Created /data/mm5v3k/mm5v3-kenia/MM5/Run/PI1830
node1 – rsl_nproc_all 8, rsl_myproc 0
node2 – rsl_nproc_all 8, rsl_myproc 1
node3 – rsl_nproc_all 8, rsl_myproc 2
node5 – rsl_nproc_all 8, rsl_myproc 4
node6 – rsl_nproc_all 8, rsl_myproc 5
node8 – rsl_nproc_all 8, rsl_myproc 7
node4 – rsl_nproc_all 8, rsl_myproc 3
node7 – rsl_nproc_all 8, rsl_myproc 6
rm_l_3_23458: (1969.804688) net_send: could not write to fd=5, errno = 32
rm_l_7_2332: (1968.929688) net_send: could not write to fd=5, errno = 32
rm_l_1_4222: (1970.253906) net_send: could not write to fd=5, errno = 32
rm_l_2_3760: (1970.035156) net_send: could not write to fd=7, errno = 32
rm_l_4_2499: (1969.601562) net_send: could not write to fd=5, errno = 32
rm_l_5_2746: (1969.371094) net_send: could not write to fd=5, errno = 32
P4 procgroup file is /data/mm5v3k/mm5v3-kenia/MM5/Run/PI1830.

Hi garlay,

A “net_send: could not write to fd=5, errno = 32” error typically means that one of your processes have died and the remaining are getting errors when sending it messages. Why the process is dying, I don’t know. However, given that it’s MM5, I would first check that you have enough stack size by setting your environments stack size to unlimited.

  • Mat

i have this on my .tcshrc

limit stacksize unlimited

look if i write

[mm5v3k@cl2master /etc]# limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 0 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 32 kbytes
maxproc 100
[mm5v3k@cl2master /etc]#


what do you think mkcolg

Hi Garlay,

I’ve looked at several MM5 failures and almost all of them we’re due to stack overflows. I have also seen several cases, when the problem size was simply too big for the available memory. Even ‘unlimit’ stack space has a limit.

Granted your issue is a bit different in that it works with 9 processes but fails at 10. If you have the PGI CDK product, I would recompile MM5 with “-g”, link with the PGI MPICH library, and run the code in the PGI debugger PGDBG. This would give you a starting point.

You could also be having an issue with one of you cluster’s nodes. Is anything different with the 10th node? What happens if you change your machines file so that the 10th node is first?

  • Mat

i make a search in /var/log/ message file in all of my nodes and i found some kind a error in node1 that is the node that start all the process. So the error is this:


Oct 20 11:33:35 node1 kernel: BUG: soft lockup detected on CPU#1!
Oct 20 11:33:35 node1 kernel:
Oct 20 11:33:35 node1 kernel: Call Trace:
Oct 20 11:33:35 node1 kernel: [] show_trace+0x34/0x47
Oct 20 11:33:35 node1 kernel: [] dump_stack+0x12/0x17
Oct 20 11:33:35 node1 kernel: [] softlockup_tick+0xdb/0xf6
Oct 20 11:33:35 node1 kernel: [] update_process_times+0x42/0x68
Oct 20 11:33:35 node1 kernel: [] smp_local_timer_interrupt+0x23/0x47
Oct 20 11:33:35 node1 kernel: [] smp_apic_timer_interrupt+0x41/0x47
Oct 20 11:33:35 node1 kernel: [] apic_timer_interrupt+0x66/0x6c
Oct 20 11:33:35 node1 kernel: DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Oct 20 11:33:35 node1 kernel: Leftover inexact backtrace:
Oct 20 11:33:35 node1 kernel: [] security_port_sid+0x32/0x98
Oct 20 11:33:35 node1 kernel: [] security_port_sid+0x1e/0x98
Oct 20 11:33:35 node1 kernel: [] selinux_ip_postroute_last+0x186/0x1d0
Oct 20 11:33:35 node1 kernel: [] :bnx2:bnx2_start_xmit+0x28c/0x4cc
Oct 20 11:33:35 node1 kernel: [] nf_iterate+0x41/0x7d
Oct 20 11:33:35 node1 kernel: [] ip_finish_output+0x0/0x1a8
Oct 20 11:33:35 node1 kernel: [] nf_hook_slow+0x5d/0xbf
Oct 20 11:33:35 node1 kernel: [] ip_finish_output+0x0/0x1a8
Oct 20 11:33:35 node1 kernel: [] ip_output+0xa4/0x249
Oct 20 11:33:35 node1 kernel: [] ip_queue_xmit+0x400/0x455
Oct 20 11:33:35 node1 kernel: [] cache_alloc_refill+0x125/0x192
Oct 20 11:33:35 node1 kernel: [] tcp_transmit_skb+0x653/0x68b
Oct 20 11:33:35 node1 kernel: [] __alloc_skb+0x77/0x123
Oct 20 11:33:35 node1 kernel: [] tcp_rcv_established+0x727/0x917
“messages.1” 1691L, 131381C

Hi Garlay,

My guess is that when one process dies, the kernel error is caused by reading from the now dead socket connection. In other words, this is just a symptom and not the main problem. Though, I’m not a kernel expert so my thoughts are strictly based on 15 minutes of googling the string “soft lockup detected”. My impression is that this is a generic error message that can be caused by any number of problems. While it could be a kernel bug, problem with your network card, or your system’s memory, I would think you’d have seen problems while running other codes, not just MM5. If you can, you should definitely use a debugger.

  • Mat