GDB CUDA Fortran hang?

Hello, I am pasting you the backtrace of two processes, running CUDA fortran+MPI on 2 gpus with 2 cpu cores. Ps aux under Centos 6.2 produces one process at sleeping state (Process 0) and one at running (Process 1), so the code has been hanging after million of iterations, always at the same part of the code, but at random iteration (could be 10th or could be 1000000th), so I am suspecting some kind of deadlocking. Could someone plz tell me if there is something suspicious at the backtraces of the two processes? I think Process 0 is waiting the outcome of Process 1 to proceed, thats why it hangs!

Process 0:

#0 0x00002aebe76485e3 in select () at …/sysdeps/unix/syscall-template.S:82
#1 0x000000000043c673 in socket_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_sock_sr.c:270
#2 0x000000000044ce9c in recv_message ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:181
#3 0x000000000044cd25 in p4_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:115
#4 0x00000000004534ee in MPID_CH_Check_incoming () at ./chchkdev.c:73
#5 0x000000000044efc5 in MPID_RecvComplete () at ./adi2recv.c:185
#6 0x000000000044621b in PMPI_Waitall ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/waitall.c:190
#7 0x00000000004465c3 in PMPI_Sendrecv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/sendrecv.c:95
#8 0x000000000042d154 in intra_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/intra_fns_new.c:248
#9 0x0000000000427fe7 in PMPI_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/barrier.c:66
#10 0x0000000000420e93 in pmpi_barrier_ ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/fortran/src/barrierf.c:83
#11 0x000000000041abe0 in pcgmp () at ./nextg.f90:1452
#12 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#13 0x00000000004089ee in main ()
#14 0x00002aebe7585d1d in __libc_start_main (main=0x4089b0 , argc=5,
ubp_av=0x7fffe0f000d8, init=, fini=,
rtld_fini=, stack_end=0x7fffe0f000c8) at libc-start.c:226
#15 0x00000000004088e9 in _start ()

Process 1:

#0 0x00007fff78dfba11 in clock_gettime ()
#1 0x00002b1932575e46 in clock_gettime (clock_id=4, tp=0x7fff78d55a50)
at …/sysdeps/unix/clock_gettime.c:116
#2 0x00002b19333621ce in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b1932dca394 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b1932ce968f in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b1932cd9950 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b1932ccd95f in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b19329a4d73 in ?? ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#8 0x00002b19329c283d in cudaDeviceSynchronize ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#9 0x000000000045d059 in cudadevicesynchronize_ ()
#10 0x000000000041abca in pcgmp () at ./nextg.f90:1452
#11 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#12 0x00000000004089ee in main ()
#13 0x00002b1933bfbd1d in __libc_start_main (main=0x4089b0 , argc=8,
ubp_av=0x7fff78d56cc8, init=, fini=,
rtld_fini=, stack_end=0x7fff78d56cb8) at libc-start.c:226
#14 0x00000000004088e9 in _start ()

PS: Been using PG Fortran 13.3 and 14.3, Mpich 1.2.7 on both Windows 7 64 bit and RHEL 6.2 64 bit. Same results on both configurations and have been trying on both 2x470 and 2x570 GPUS, with different CPUS (I5 and Xeon), so I think its NOT Hardware/OS/driver related!

Hi epewee,

Sorry but there’s not enough information here to make a diagnosis. Given your description, then you’re probably correct that is a problem with your program rather than any hardware/OS issue.

Both processes are stuck some place in “pcgmp”. Process 0 is in a barrier so explains why it’s stuck, but Process 1 is in cudaDeviceSynchronize. I’m not sure if the it’s really hung in cudaDeviceSynchronize or this just happened to be where you stopped in the debugger.

What’s happening in “pcgmp”? Is process 1 in some type of loop and never hitting the barrier? Did the kernel crash and put the device in a bad state? Does the program work if you use 1 process?

  • Mat

Hello and thanks for the reply! I away and had not seen your post.

We just made the 1 GPU/1 CPU code in order to check more of the issue. The code DOES hang in same point even with no MPI implemented, i.e. 1 the before mentioned process 1 GPU/1 CPU.

As far as your questions are concerned, the code has been hanged like that for hours, so the debugger was not randomly at devicesynchronize, its where it actually hanged. How can I see if the kernel crashed and put the GPU in a bad state? Pcgmp is itterative yes but it has completed millions of itterations before hanging. I mean for the code to run for 10000 timesteps around 30 hours of real time are required. Sometimes run hangs at 30 minutes, other times at 29.9 hours, or sometimes it doesn’t hang at all, completing and exiting.

PS: Since you seem to be here to actually help with deep insight, do you have any recommendations on adequate debugging tool? Is nsight usable with CUDA Fortran atm or it requires -Mcude=emul?

Update: Atm I am running cuda-memcheck for memory race condition and leak check…

Hi epewee,

While I don’t know for sure, given your description it sounds like a problem where not all the threads/kernels are reaching a barrier, either in a kernel (syncthreads) or from a cudaDeviceSynchronize call.

Do you call “syncthreads” in your kernels? if so, is it under a conditional branch?

For on device debugging you would need to use Allinea DDT (Arm Forge | Cross Platform Parallel Debugger for C++ and Cuda – Arm®). Otherwise, compile in emulation mode (-Mcuda=emu) and run under PGI’s debugger, PGDBG, which is included with your compilers.

  • Mat