Deadlock in hybrid OpenMP/CUDA code

Hi,

I have a hybrid OpenMP/CUDA code and I am getting deadlock very often. There is one critical section in the code from which all CUDA calls are made. CUDA calls may be done by any thread and the CUDA kernels run in several streams. When I get to the deadlock gdb backstack gives me the following:

gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fb77c0 (LWP 6549))]
#0  0x00007ffff587250a in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
(gdb) bt
#0  0x00007ffff587250a in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#1  0x00007ffff586986d in GOMP_critical_start () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x0000000000443de6 in ComputeDensities (row_basis=..., col_basis=..., common_in_generator=..., ll0_jj0vec=std::vector of length 5, capacity 8 = {...}, k0max_LL0=std::vector of length 5, capacity 5 = {...}, 
    ntensors_LL0=std::vector of length 5, capacity 5 = {...}, a0max=1, ir0=..., ipjpindices=std::vector of length 14571, capacity 14571 = {...}, tensor_rmes=std::vector of length 22688, capacity 22688 = {...}, 
    vleft=std::vector of length 1, capacity 1 = {...}, vright=std::vector of length 1, capacity 1 = {...}, cuda_vleft=0x7014c0000, cuda_vright=0x702800000, n_vectors=1, 
    su3so3cgs=std::vector of length 5, capacity 5 = {...}, wig9j_cached=std::vector of length 5, capacity 5 = {...}, wig9s_cached=std::vector of length 7, capacity 7 = {...}, 
    global_densities=std::vector of length 1, capacity 1 = {...}, gpu_blocks=@0x7fffffffd104: 9860, cpu_blocks=@0x7fffffffd100: 22) at /home/oberhuber/workspace/su3dense/programs/su3dense/su3dense.h:195
#3  0x0000000000445eaa in main._omp_fn.0(void) () at /home/oberhuber/workspace/su3dense/programs/su3dense/su3dense.h:419
#4  0x00007ffff586ccbf in GOMP_parallel () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#5  0x00000000004451e6 in main (argc=13, argv=0x7fffffffdb68) at /home/oberhuber/workspace/su3dense/programs/su3dense/su3dense.h:417

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fffefc7f700 (LWP 6553))]
#0  0x00007ffff5389cd8 in accept4 (fd=9, addr=..., addr_len=0x7fffefc7ed98, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
40      ../sysdeps/unix/sysv/linux/accept4.c: No such file or directory.
(gdb) bt
#0  0x00007ffff5389cd8 in accept4 (fd=9, addr=..., addr_len=0x7fffefc7ed98, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:40
#1  0x00007fffefe3e706 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fffefe3223d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffefe3f0f8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff7bc16ba in start_thread (arg=0x7fffefc7f700) at pthread_create.c:333
#5  0x00007ffff538882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fffef47e700 (LWP 6554))]
#0  0x00007ffff537cb5d in poll () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff537cb5d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fffefe3d853 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fffefea092e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffefe3f0f8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff7bc16ba in start_thread (arg=0x7fffef47e700) at pthread_create.c:333
#5  0x00007ffff538882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

(gdb) thread 4
[Switching to thread 4 (Thread 0x7fffeec7d700 (LWP 6555))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185     ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fffefe3fbad in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fffefe04ae4 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffefe3f0f8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffff7bc16ba in start_thread (arg=0x7fffeec7d700) at pthread_create.c:333
#5  0x00007ffff538882d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The backstack of the thread 1 is clear to me - this thread is waiting for entering the critical section. However, I do not understand the backstack of threads 2,3,4. I do not see any function from my code. The threads are stuck in CUDA library. As I said any CUDA calls are done from within the critical section. Could anybody give me some hint?

Thanks, Tomas.