Debug of a MPICH call inside ScaLAPACK pcheevx

Hello there! I am trying to debug a large program that uses ACML lapack, mpich 1.2.7, BLACS and ScaLAPACK from netlib compiled with pgi 6.0-8 (all tests ok), and I am getting SIGSEVx 13 and SIGSEV 11 errors. I have tracked down to this stack point:
On the process that have signalled, I got
+++++++++++++++
pgdbg [all] 2> where

STACK TRACE:

#5 __libc_start_main address: 0x4015a974
#4 main address: 0x804b07b
#3 parmain311 line: “./main/main_sca.f90”@1146 address: 0x805aaef
#2 diago line: “./hamcubic/diago_zb_sca.f90”@783 address: 0x80ca6ab
ponto = 0x827f720, matrix_type = 8, ptype = .TRUE., pc = 0x82698e0, gx = 2, gy = 2, gz = 2, attrib = 0x8269910, vfix = 0x8299290, vport = 0x829ee20, vxc = 0x82c44b0, vhet = 0x829c050, vuni = 0x82a7750, vhid = 0x82bbb50, vmag = 0x82be920, vpol = 0x82c16e0, basis_matrix = 0x8269970, rot_matrix = 0x82699a0, het = 0x828fe20, xc_type = 0, strain_type = 0, ierr = 0, numhole = 48, numele = 48, dimax = 1000, holevlr = 0x82dfcd0, holeestr = 0x403a5010, holeesti = 0x403d4010, holepart = 0x82e0470, eletvlr = 0x82eae00, eletestr = 0x404ee010, eletesti = 0x4051d010, eletpart = 0x82eb5a0, context = 0
#1 pgf90_dealloc address: 0x8222cc8
=> #0 __hpf_dealloc address: 0x8222b27
+++++++++++++++

Or, when i run one more step, the same process gave me
++++++++++++++++
pgdbg [all] 2> where

STACK TRACE:

#3 sig_err_handler file: p4_error.c address: 0x81ef34d
#2 p4_error address: 0x81ef16b
#1 zap_remote_p4_processes address: 0x81dadfb
=> #0 __connect address: 0x4021db1c

++++++++++++

On the other processes, I got the message:
+++++++++++++
pgdbg [all] 0> where

STACK TRACE:

#15 _libc_start_main address: 0x4015a974
#14 main address: 0x804b07b
#13 parmain311 line: “./main/main_sca.f90”@1146 address: 0x805aaef
#12 diago line: “./hamcubic/diago_zb_sca.f90”@654 address: 0x80c9a5a
ponto = 0x827f2fc, matrix_type = 8, ptype = .TRUE., pc = 0x82698e0, gx = 2, gy = 2, gz = 2, attrib = 0x8269910, vfix = 0x8299c10, vport = 0x829f7a0, vxc = 0x82b4370, vhet = 0x829c9d0, vuni = 0x82a8c50, vhid = 0x82aba20, vmag = 0x82ae7e0, vpol = 0x82b15b0, basis_matrix = 0x8269970, rot_matrix = 0x82699a0, het = 0x8290530, xc_type = 0, strain_type = 0, ierr = 0, numhole = 48, numele = 48, dimax = 1000, holevlr = 0x82cdd40, holeestr = 0x4047b010, holeesti = 0x404aa010, holepart = 0x82cea20, eletvlr = 0x82d8e70, eletestr = 0x40622010, eletesti = 0x40651010, eletpart = 0x82d9b50, context = 0
#11 pcheevx line: “pcheevx.f”@539 address: 0x8107a12
jobz = 0x824a64e, range = 0x824a64d, uplo = 0x824a64c, n = 1000, a = 0x4081f010, ia = 1, ja = 1, desca = 0x831e370, vl = 0, vu = 0, il = 203, iu = 298, abstol = 2.3509887e-38, m = 0, nz = 96, w = 0x832a300, orfac = -1, z = 0x40a10010, iz = 1, jz = 1, descz = 0x831e370, work = 0x826ddb0, lwork = -1, rwork = 0x826ddc8, lrwork = -1, iwork = 0x826ddcc, liwork = -1, ifail = 0x830abe0, iclustr = 0x832c6d0, gap = 0x83343c0, info = 0
#13 parmain311 line: “./main/main_sca.f90”@1146 address: 0x805aaef
#12 diago line: “./hamcubic/diago_zb_sca.f90”@654 address: 0x80c9a5a
ponto = 0x827f2fc, matrix_type = 8, ptype = .TRUE., pc = 0x82698e0, gx = 2, gy = 2, gz = 2, attrib = 0x8269910, vfix = 0x8299c10, vport = 0x829f7a0, vxc = 0x82b4370, vhet = 0x829c9d0, vuni = 0x82a8c50, vhid = 0x82aba20, vmag = 0x82ae7e0, vpol = 0x82b15b0, basis_matrix = 0x8269970, rot_matrix = 0x82699a0, het = 0x8290530, xc_type = 0, strain_type = 0, ierr = 0, numhole = 48, numele = 48, dimax = 1000, holevlr = 0x82cdd40, holeestr = 0x4047b010, holeesti = 0x404aa010, holepart = 0x82cea20, eletvlr = 0x82d8e70, eletestr = 0x40622010, eletesti = 0x40651010, eletpart = 0x82d9b50, context = 0
#11 pcheevx line: “pcheevx.f”@539 address: 0x8107a12
jobz = 0x824a64e, range = 0x824a64d, uplo = 0x824a64c, n = 1000, a = 0x4081f010, ia = 1, ja = 1, desca = 0x831e370, vl = 0, vu = 0, il = 203, iu = 298, abstol = 2.3509887e-38, m = 0, nz = 96, w = 0x832a300, orfac = -1, z = 0x40a10010, iz = 1, jz = 1, descz = 0x831e370, work = 0x826ddb0, lwork = -1, rwork = 0x826ddc8, lrwork = -1, iwork = 0x826ddcc, liwork = -1, ifail = 0x830abe0, iclustr = 0x832c6d0, gap = 0x83343c0, info = 0
#10 pslamch line: “pslamch.f”@1 address: 0x81077a4
ictxt = 0, cmach = 0x824ce44
#9 sgamx2d
address: 0x8182e81
#8 PMPI_Allreduce address: 0x81ccf20
#7 intra_Allreduce file: intra_fns_new.c address: 0x81d6a1d
#6 PMPI_Sendrecv address: 0x81c2e11
#5 MPI_Waitall address: 0x81c23ed
#4 MPID_RecvComplete address: 0x81e4b8b
#3 MPID_CH_Check_incoming address: 0x81f7c2f
#2 p4_recv address: 0x81f1ff7
#1 recv_message address: 0x81f219b
=> #0 __select address: 0x402169f8
+++++++++++++

It seems that a MPI_Send or a MPI_Recv is being waited or is trying to access an illegal address in the memory, and then the process crash. The error that the program gave me is
+++++++==
p2_26928: p4_error: interrupt SIGSEGV: 11
[0] Stopped at 0x402169f8, function __select
402169f8: 5b popl %ebx
rm_l_2_26929: (14.261221) net_send: could not write to fd=5, errno = 32
+++++++++

I would appreciate some hints to nail better this error and, if I am lucky, how to solve it…

Thank you very much!

Hi ispmarin,

Can you post the significant code around line 783 of “diago_zb_sca.f90”? I’d like to know what is being deallocated, i.e. is it a compiler temporary variable or a user variable?

Also, what OS are you using? Are you compiling in 64 or 32-bit mode?

Thanks,
Mat

Hello Mat:
these are the lines where the compiler complains (779-788), and then this routine (called diago) ends just after the deallocation. Line 783 is the deallocation of an user array, AUXVETOR:
++++++
COMPLEX, ALLOCATABLE, DIMENSION(:,:) :: AUXVETOR
++++++

and is allocated:
+++++++++
ALLOCATE(AUXVETOR(LDROW,LDCOL))
+++++++++

where :LDROW, LDCOL are the local sizes that are given by the routine NUMROC from ScaLAPACK.
And sorry about that: I am compiling on an amd64 using Ubuntu Dapper Drake, in 64 bit mode.

++++++++++++++++++++
DEALLOCATE(VETOR,VALOR)
DEALLOCATE(HT)
!Desaloca auxiliares de PCHEEVX
DEALLOCATE(DESCHT,DESCVT)
DEALLOCATE(AUXVETOR)
DEALLOCATE(WORK,RWORK,IWORK)
DEALLOCATE(IFAIL,ICLUSTR,GAP)
!
write(STFILE,*) ‘Fim de DIAGO!’
END SUBROUTINE DIAGO
++++++++++++++++++++

The AUXVETOR is passed to the ScaLAPACK routine PCHEEVX, and is used only inside. Other thing that I found is that always node 2 (on four processors, 0-3) die.

Thank you!

I made some advances after some help. The problem is (I think) the size of the two vectors that are locally allocated and passed to the pcheevx routine. But I’ve encountered some deallocation problems in the past. How the automatic deallocation works in pgf90 6.0-8? Inside and outside subroutines? In version 6.1 is better?

Thank you again!

When you say that you think it’s the size of the two vectors, how big are they? Although you are compiling in 64-bits, I don’t believe ScaLAPACK has been ported to take advantage of 64-bit pointers and hence passing in large arrays would cause problems. Does it work if you make the vectors smaller? Does the address of AUXVETOR change after the call to phceevx (i.e. is AUXVETOR being corrupted)?

Also, MPICH has a known problem with 64-bit pointers since it uses a Fortran INTEGER to hold the value of a pointer. They tried a solution in which only the pointer’s offset was stored but I don’t believe it worked properly in the basic MPICH install. We were given a patch from Quadrics which is suppose to have added better 64-bit pointer support, but I don’t know if it will help your specific problem. Please see our MPICH Tips and Techniques for more information. You would need to rebuild your MPICH library with the patched source.

Automatic deallocation only occurs with compiler generated temporary arrays, which I don’t believe is occuring here. Somehow the address of AUXVETOR is being corrupted but it is most likely not a compiler bug. Your welcome to try 6.1, though.

  • Mat