Hello, I am pasting you the backtrace of two processes, running CUDA fortran+MPI on 2 gpus with 2 cpu cores. Ps aux under Centos 6.2 produces one process at sleeping state (Process 0) and one at running (Process 1), so the code has been hanging after million of iterations, always at the same part of the code, but at random iteration (could be 10th or could be 1000000th), so I am suspecting some kind of deadlocking. Could someone plz tell me if there is something suspicious at the backtraces of the two processes? I think Process 0 is waiting the outcome of Process 1 to proceed, thats why it hangs!
Process 0:
#0 0x00002aebe76485e3 in select () at …/sysdeps/unix/syscall-template.S:82
#1 0x000000000043c673 in socket_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_sock_sr.c:270
#2 0x000000000044ce9c in recv_message ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:181
#3 0x000000000044cd25 in p4_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:115
#4 0x00000000004534ee in MPID_CH_Check_incoming () at ./chchkdev.c:73
#5 0x000000000044efc5 in MPID_RecvComplete () at ./adi2recv.c:185
#6 0x000000000044621b in PMPI_Waitall ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/waitall.c:190
#7 0x00000000004465c3 in PMPI_Sendrecv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/sendrecv.c:95
#8 0x000000000042d154 in intra_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/intra_fns_new.c:248
#9 0x0000000000427fe7 in PMPI_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/barrier.c:66
#10 0x0000000000420e93 in pmpi_barrier_ ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/fortran/src/barrierf.c:83
#11 0x000000000041abe0 in pcgmp () at ./nextg.f90:1452
#12 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#13 0x00000000004089ee in main ()
#14 0x00002aebe7585d1d in __libc_start_main (main=0x4089b0 , argc=5,
ubp_av=0x7fffe0f000d8, init=, fini=,
rtld_fini=, stack_end=0x7fffe0f000c8) at libc-start.c:226
#15 0x00000000004088e9 in _start ()
Process 1:
#0 0x00007fff78dfba11 in clock_gettime ()
#1 0x00002b1932575e46 in clock_gettime (clock_id=4, tp=0x7fff78d55a50)
at …/sysdeps/unix/clock_gettime.c:116
#2 0x00002b19333621ce in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b1932dca394 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b1932ce968f in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b1932cd9950 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b1932ccd95f in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b19329a4d73 in ?? ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#8 0x00002b19329c283d in cudaDeviceSynchronize ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#9 0x000000000045d059 in cudadevicesynchronize_ ()
#10 0x000000000041abca in pcgmp () at ./nextg.f90:1452
#11 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#12 0x00000000004089ee in main ()
#13 0x00002b1933bfbd1d in __libc_start_main (main=0x4089b0 , argc=8,
ubp_av=0x7fff78d56cc8, init=, fini=,
rtld_fini=, stack_end=0x7fff78d56cb8) at libc-start.c:226
#14 0x00000000004088e9 in _start ()
PS: Been using PG Fortran 13.3 and 14.3, Mpich 1.2.7 on both Windows 7 64 bit and RHEL 6.2 64 bit. Same results on both configurations and have been trying on both 2x470 and 2x570 GPUS, with different CPUS (I5 and Xeon), so I think its NOT Hardware/OS/driver related!