CUDA Fortran + MPI Debug help!

Dear community,
Been struggling to debug a hanging MPI+CUDA Fortran code the past 5 months. I am desperate so I have decided to post here. Process 0 hangs waiting from Process 1 to produce a result (interupable sleeping vs running in “ps aux” under RHEL 6.2). Here is the backtrack of the 2 processes:

#0 0x00002aebe76485e3 in select () at …/sysdeps/unix/syscall-template.S:82
#1 0x000000000043c673 in socket_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_sock_sr.c:270
#2 0x000000000044ce9c in recv_message ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:181
#3 0x000000000044cd25 in p4_recv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/mpid/ch_p4/p4/lib/p4_tsr.c:115
#4 0x00000000004534ee in MPID_CH_Check_incoming () at ./chchkdev.c:73
#5 0x000000000044efc5 in MPID_RecvComplete () at ./adi2recv.c:185
#6 0x000000000044621b in PMPI_Waitall ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/waitall.c:190
#7 0x00000000004465c3 in PMPI_Sendrecv ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/pt2pt/sendrecv.c:95
#8 0x000000000042d154 in intra_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/intra_fns_new.c:248
#9 0x0000000000427fe7 in PMPI_Barrier ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/coll/barrier.c:66
#10 0x0000000000420e93 in pmpi_barrier_ ()
at /home/sw/cdk/cdk/mpich-1.2.7/mpich-1.2.7/src/fortran/src/barrierf.c:83
#11 0x000000000041abe0 in pcgmp () at ./nextg.f90:1452
#12 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#13 0x00000000004089ee in main ()
#14 0x00002aebe7585d1d in __libc_start_main (main=0x4089b0 , argc=5,
ubp_av=0x7fffe0f000d8, init=, fini=,
rtld_fini=, stack_end=0x7fffe0f000c8) at libc-start.c:226
#15 0x00000000004088e9 in _start ()

#0 0x00007fff78dfba11 in clock_gettime ()
#1 0x00002b1932575e46 in clock_gettime (clock_id=4, tp=0x7fff78d55a50)
at …/sysdeps/unix/clock_gettime.c:116
#2 0x00002b19333621ce in ?? () from /usr/lib64/libcuda.so.1
#3 0x00002b1932dca394 in ?? () from /usr/lib64/libcuda.so.1
#4 0x00002b1932ce968f in ?? () from /usr/lib64/libcuda.so.1
#5 0x00002b1932cd9950 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00002b1932ccd95f in ?? () from /usr/lib64/libcuda.so.1
#7 0x00002b19329a4d73 in ?? ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#8 0x00002b19329c283d in cudaDeviceSynchronize ()
from /opt/pgi/linux86-64/2013/cuda/4.2/lib64/libcudart.so.4
#9 0x000000000045d059 in cudadevicesynchronize_ ()
#10 0x000000000041abca in pcgmp () at ./nextg.f90:1452
#11 0x0000000000411c52 in MAIN () at ./nextg.f90:796
#12 0x00000000004089ee in main ()
#13 0x00002b1933bfbd1d in __libc_start_main (main=0x4089b0 , argc=8,
ubp_av=0x7fff78d56cc8, init=, fini=,
rtld_fini=, stack_end=0x7fff78d56cb8) at libc-start.c:226
#14 0x00000000004088e9 in _start ()

I cannot use the nsight profiling since its MPI+CUDA so I am just using gdb to attach to the processes. I have access to the hanging process right now, so any more suggested tools to generate feedback would be appreciated!

Been using PG Fortran 13.3 and 14.3, Mpich 1.2.7 on both Windows 7 64 bit and RHEL 6.2 64 bit. Same results on both configurations and have been trying on both 2x470 and 2x570 GPUS, with different CPUS (I5 and Xeon), so I think its NOT Hardware/OS/driver related