Hello there! I am trying to debug a large program that uses ACML lapack, mpich 1.2.7, BLACS and ScaLAPACK from netlib compiled with pgi 6.0-8 (all tests ok), and I am getting SIGSEVx 13 and SIGSEV 11 errors. I have tracked down to this stack point:
On the process that have signalled, I got
+++++++++++++++
pgdbg [all] 2> where
STACK TRACE:
#5 __libc_start_main address: 0x4015a974
#4 main address: 0x804b07b
#3 parmain311 line: “./main/main_sca.f90”@1146 address: 0x805aaef
#2 diago line: “./hamcubic/diago_zb_sca.f90”@783 address: 0x80ca6ab
ponto = 0x827f720, matrix_type = 8, ptype = .TRUE., pc = 0x82698e0, gx = 2, gy = 2, gz = 2, attrib = 0x8269910, vfix = 0x8299290, vport = 0x829ee20, vxc = 0x82c44b0, vhet = 0x829c050, vuni = 0x82a7750, vhid = 0x82bbb50, vmag = 0x82be920, vpol = 0x82c16e0, basis_matrix = 0x8269970, rot_matrix = 0x82699a0, het = 0x828fe20, xc_type = 0, strain_type = 0, ierr = 0, numhole = 48, numele = 48, dimax = 1000, holevlr = 0x82dfcd0, holeestr = 0x403a5010, holeesti = 0x403d4010, holepart = 0x82e0470, eletvlr = 0x82eae00, eletestr = 0x404ee010, eletesti = 0x4051d010, eletpart = 0x82eb5a0, context = 0
#1 pgf90_dealloc address: 0x8222cc8
=> #0 __hpf_dealloc address: 0x8222b27
+++++++++++++++
Or, when i run one more step, the same process gave me
++++++++++++++++
pgdbg [all] 2> where
STACK TRACE:
#3 sig_err_handler file: p4_error.c address: 0x81ef34d
#2 p4_error address: 0x81ef16b
#1 zap_remote_p4_processes address: 0x81dadfb
=> #0 __connect address: 0x4021db1c
++++++++++++
On the other processes, I got the message:
+++++++++++++
pgdbg [all] 0> where
STACK TRACE:
#15 _libc_start_main address: 0x4015a974
#14 main address: 0x804b07b
#13 parmain311 line: “./main/main_sca.f90”@1146 address: 0x805aaef
#12 diago line: “./hamcubic/diago_zb_sca.f90”@654 address: 0x80c9a5a
ponto = 0x827f2fc, matrix_type = 8, ptype = .TRUE., pc = 0x82698e0, gx = 2, gy = 2, gz = 2, attrib = 0x8269910, vfix = 0x8299c10, vport = 0x829f7a0, vxc = 0x82b4370, vhet = 0x829c9d0, vuni = 0x82a8c50, vhid = 0x82aba20, vmag = 0x82ae7e0, vpol = 0x82b15b0, basis_matrix = 0x8269970, rot_matrix = 0x82699a0, het = 0x8290530, xc_type = 0, strain_type = 0, ierr = 0, numhole = 48, numele = 48, dimax = 1000, holevlr = 0x82cdd40, holeestr = 0x4047b010, holeesti = 0x404aa010, holepart = 0x82cea20, eletvlr = 0x82d8e70, eletestr = 0x40622010, eletesti = 0x40651010, eletpart = 0x82d9b50, context = 0
#11 pcheevx line: “pcheevx.f”@539 address: 0x8107a12
jobz = 0x824a64e, range = 0x824a64d, uplo = 0x824a64c, n = 1000, a = 0x4081f010, ia = 1, ja = 1, desca = 0x831e370, vl = 0, vu = 0, il = 203, iu = 298, abstol = 2.3509887e-38, m = 0, nz = 96, w = 0x832a300, orfac = -1, z = 0x40a10010, iz = 1, jz = 1, descz = 0x831e370, work = 0x826ddb0, lwork = -1, rwork = 0x826ddc8, lrwork = -1, iwork = 0x826ddcc, liwork = -1, ifail = 0x830abe0, iclustr = 0x832c6d0, gap = 0x83343c0, info = 0
#13 parmain311 line: “./main/main_sca.f90”@1146 address: 0x805aaef
#12 diago line: “./hamcubic/diago_zb_sca.f90”@654 address: 0x80c9a5a
ponto = 0x827f2fc, matrix_type = 8, ptype = .TRUE., pc = 0x82698e0, gx = 2, gy = 2, gz = 2, attrib = 0x8269910, vfix = 0x8299c10, vport = 0x829f7a0, vxc = 0x82b4370, vhet = 0x829c9d0, vuni = 0x82a8c50, vhid = 0x82aba20, vmag = 0x82ae7e0, vpol = 0x82b15b0, basis_matrix = 0x8269970, rot_matrix = 0x82699a0, het = 0x8290530, xc_type = 0, strain_type = 0, ierr = 0, numhole = 48, numele = 48, dimax = 1000, holevlr = 0x82cdd40, holeestr = 0x4047b010, holeesti = 0x404aa010, holepart = 0x82cea20, eletvlr = 0x82d8e70, eletestr = 0x40622010, eletesti = 0x40651010, eletpart = 0x82d9b50, context = 0
#11 pcheevx line: “pcheevx.f”@539 address: 0x8107a12
jobz = 0x824a64e, range = 0x824a64d, uplo = 0x824a64c, n = 1000, a = 0x4081f010, ia = 1, ja = 1, desca = 0x831e370, vl = 0, vu = 0, il = 203, iu = 298, abstol = 2.3509887e-38, m = 0, nz = 96, w = 0x832a300, orfac = -1, z = 0x40a10010, iz = 1, jz = 1, descz = 0x831e370, work = 0x826ddb0, lwork = -1, rwork = 0x826ddc8, lrwork = -1, iwork = 0x826ddcc, liwork = -1, ifail = 0x830abe0, iclustr = 0x832c6d0, gap = 0x83343c0, info = 0
#10 pslamch line: “pslamch.f”@1 address: 0x81077a4
ictxt = 0, cmach = 0x824ce44
#9 sgamx2d address: 0x8182e81
#8 PMPI_Allreduce address: 0x81ccf20
#7 intra_Allreduce file: intra_fns_new.c address: 0x81d6a1d
#6 PMPI_Sendrecv address: 0x81c2e11
#5 MPI_Waitall address: 0x81c23ed
#4 MPID_RecvComplete address: 0x81e4b8b
#3 MPID_CH_Check_incoming address: 0x81f7c2f
#2 p4_recv address: 0x81f1ff7
#1 recv_message address: 0x81f219b
=> #0 __select address: 0x402169f8
+++++++++++++
It seems that a MPI_Send or a MPI_Recv is being waited or is trying to access an illegal address in the memory, and then the process crash. The error that the program gave me is
+++++++==
p2_26928: p4_error: interrupt SIGSEGV: 11
[0] Stopped at 0x402169f8, function __select
402169f8: 5b popl %ebx
rm_l_2_26929: (14.261221) net_send: could not write to fd=5, errno = 32
+++++++++
I would appreciate some hints to nail better this error and, if I am lucky, how to solve it…
Thank you very much!