CUDA Fortran + MPI Debug help!

nabber · May 8, 2014, 11:40am

Dear community,
Been struggling to debug a hanging MPI+CUDA Fortran code the past 5 months. I am desperate so I have decided to post here. Process 0 hangs waiting from Process 1 to produce a result (interupable sleeping vs running in “ps aux” under RHEL 6.2). Here is the backtrack of the 2 processes:

#0 0x00002aebe76485e3 in select () at …/sysdeps/unix/syscall-template.S:82
#1 0x000000000043c673 in socket_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_sock_sr.c:270
#2 0x000000000044ce9c in recv_message ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:181
#3 0x000000000044cd25 in p4_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:115
#4 0x00000000004534ee in MPID_CH_Check_incoming () at ./chchkdev.c:73
#5 0x000000000044efc5 in MPID_RecvComplete () at ./adi2recv.c:185
#6 0x000000000044621b in PMPI_Waitall ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/waitall.c:190
#7 0x00000000004465c3 in PMPI_Sendrecv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/sendrecv.c:95
#8 0x000000000042d154 in intra_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/intra_fns_new.c:248
#9 0x0000000000427fe7 in PMPI_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/barrier.c:66
#10 0x0000000000420e93 in pmpi_barrier_ ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/fortran/src/barrierf.c:83
#11 0x000000000041abe0 in pcgmp () at ./nextg.f90:1452
#12 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#13 0x00000000004089ee in main ()
#14 0x00002aebe7585d1d in __libc_start_main (main=0x4089b0 , argc=5,
ubp_av=0x7fffe0f000d8, init=, fini=,
rtld_fini=, stack_end=0x7fffe0f000c8) at libc-start.c:226
#15 0x00000000004088e9 in _start ()

#0 0x00007fff78dfba11 in clock_gettime ()
#1 0x00002b1932575e46 in clock_gettime (clock_id=4, tp=0x7fff78d55a50)
at …/sysdeps/unix/clock_gettime.c:116
#2 0x00002b19333621ce in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b1932dca394 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b1932ce968f in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b1932cd9950 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b1932ccd95f in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b19329a4d73 in ?? ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#8 0x00002b19329c283d in cudaDeviceSynchronize ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#9 0x000000000045d059 in cudadevicesynchronize_ ()
#10 0x000000000041abca in pcgmp () at ./nextg.f90:1452
#11 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#12 0x00000000004089ee in main ()
#13 0x00002b1933bfbd1d in __libc_start_main (main=0x4089b0 , argc=8,
ubp_av=0x7fff78d56cc8, init=, fini=,
rtld_fini=, stack_end=0x7fff78d56cb8) at libc-start.c:226
#14 0x00000000004088e9 in _start ()

I cannot use the nsight profiling since its MPI+CUDA so I am just using gdb to attach to the processes. I have access to the hanging process right now, so any more suggested tools to generate feedback would be appreciated!

nabber · May 8, 2014, 11:44am

Been using PG Fortran 13.3 and 14.3, Mpich 1.2.7 on both Windows 7 64 bit and RHEL 6.2 64 bit. Same results on both configurations and have been trying on both 2x470 and 2x570 GPUS, with different CPUS (I5 and Xeon), so I think its NOT Hardware/OS/driver related

Topic		Replies	Views
GDB CUDA Fortran hang? Legacy PGI Compilers	3	9765	May 20, 2014
use gpu and cpu with c language CUDA Programming and Performance	0	2062	May 10, 2010
Cuda-gdb hangs indefinitely CUDA-GDB	23	4271	January 16, 2024
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1475	June 5, 2014
Does kernel execution still block one CPU? CUDA Programming and Performance	4	10314	October 26, 2007
Code hangs... CUDA Programming and Performance	24	19888	August 18, 2010
cuda-gdb hangs CUDA-GDB	12	8397	May 23, 2014
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	969	April 26, 2023
cuda-gdb hangs in the CUDA 2.3 beta CUDA Programming and Performance	0	1114	June 30, 2009
Can't compile with OpenMPI 4.1.4, "broken function" nvc, nvc++ and nvfortran	5	864	August 17, 2022

CUDA Fortran + MPI Debug help!

Related topics