Dear community,
Been struggling to debug a hanging MPI+CUDA Fortran code the past 5 months. I am desperate so I have decided to post here. Process 0 hangs waiting from Process 1 to produce a result (interupable sleeping vs running in “ps aux” under RHEL 6.2). Here is the backtrack of the 2 processes:
#0 0x00002aebe76485e3 in select () at …/sysdeps/unix/syscall-template.S:82
#1 0x000000000043c673 in socket_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_sock_sr.c:270
#2 0x000000000044ce9c in recv_message ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:181
#3 0x000000000044cd25 in p4_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:115
#4 0x00000000004534ee in MPID_CH_Check_incoming () at ./chchkdev.c:73
#5 0x000000000044efc5 in MPID_RecvComplete () at ./adi2recv.c:185
#6 0x000000000044621b in PMPI_Waitall ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/waitall.c:190
#7 0x00000000004465c3 in PMPI_Sendrecv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/sendrecv.c:95
#8 0x000000000042d154 in intra_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/intra_fns_new.c:248
#9 0x0000000000427fe7 in PMPI_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/barrier.c:66
#10 0x0000000000420e93 in pmpi_barrier_ ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/fortran/src/barrierf.c:83
#11 0x000000000041abe0 in pcgmp () at ./nextg.f90:1452
#12 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#13 0x00000000004089ee in main ()
#14 0x00002aebe7585d1d in __libc_start_main (main=0x4089b0 , argc=5,
ubp_av=0x7fffe0f000d8, init=, fini=,
rtld_fini=, stack_end=0x7fffe0f000c8) at libc-start.c:226
#15 0x00000000004088e9 in _start ()
#0 0x00007fff78dfba11 in clock_gettime ()
#1 0x00002b1932575e46 in clock_gettime (clock_id=4, tp=0x7fff78d55a50)
at …/sysdeps/unix/clock_gettime.c:116
#2 0x00002b19333621ce in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b1932dca394 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b1932ce968f in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b1932cd9950 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b1932ccd95f in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b19329a4d73 in ?? ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#8 0x00002b19329c283d in cudaDeviceSynchronize ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#9 0x000000000045d059 in cudadevicesynchronize_ ()
#10 0x000000000041abca in pcgmp () at ./nextg.f90:1452
#11 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#12 0x00000000004089ee in main ()
#13 0x00002b1933bfbd1d in __libc_start_main (main=0x4089b0 , argc=8,
ubp_av=0x7fff78d56cc8, init=, fini=,
rtld_fini=, stack_end=0x7fff78d56cb8) at libc-start.c:226
#14 0x00000000004088e9 in _start ()
I cannot use the nsight profiling since its MPI+CUDA so I am just using gdb to attach to the processes. I have access to the hanging process right now, so any more suggested tools to generate feedback would be appreciated!