multi-gpu linux app hung when creating FFT plan

Hi,
I have a multi-threaded linux app running on a 8 GPU machine having 36 CPU cores ( hyperthreading disabled). Core isolation is enabled. The app creates 34 pthreads, each of them running on a separate dedicated cpu, leaving 2 cores for kernel scheduling. Out of the 34 threads, 8 are tied to the GPUs, one per GPU. Each of the 8 threads, at the same time, launch an FFT plan creation, followed by some FFT computation. The app hangs at this point.

Out of 8, the 7 threads have the following user stack trace:
#0 0x00007f409296637d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f409295feda in pthread_mutex_lock () from /lib64/libpthread.so.0
#2 0x00007f4086ef74ed in cufftXtMakePlanMany () from /usr/local/cuda/lib64/libcufft.so.9.1

and kernel stack trace ( /proc///stack ):
[] futex_wait_queue_me+0xb4/0xe0
[] futex_wait+0x112/0x240
[] do_futex+0x310/0xb50
[] SyS_futex+0x74/0x150
[] entry_SYSCALL_64_fastpath+0x18/0xa8
[] 0xffffffffffffffff

The 8th thread has the following user stack trace:
#0 0x00007f40856646a7 in ioctl () from /lib64/libc.so.6
#1 0x00007f406368499a in ?? () from /lib64/libcuda.so.1
#2 0x00007f4063685ec2 in ?? () from /lib64/libcuda.so.1
#3 0x00007f4063689782 in ?? () from /lib64/libcuda.so.1
#4 0x00007f4063677712 in ?? () from /lib64/libcuda.so.1
#5 0x00007f4063682d62 in ?? () from /lib64/libcuda.so.1
#6 0x00007f406368464c in ?? () from /lib64/libcuda.so.1
#7 0x00007f4063684703 in ?? () from /lib64/libcuda.so.1
#8 0x00007f4063619174 in ?? () from /lib64/libcuda.so.1
#9 0x00007f4063625689 in ?? () from /lib64/libcuda.so.1
#10 0x00007f4063625e4a in ?? () from /lib64/libcuda.so.1
#11 0x00007f4063530256 in ?? () from /lib64/libcuda.so.1
#12 0x00007f406352ca39 in ?? () from /lib64/libcuda.so.1
#13 0x00007f40635479f9 in ?? () from /lib64/libcuda.so.1
#14 0x00007f4063493c9d in ?? () from /lib64/libcuda.so.1
#15 0x00007f4063493fb0 in ?? () from /lib64/libcuda.so.1
#16 0x00007f40870abd6d in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#17 0x00007f40870a11d0 in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#18 0x00007f40870af4a6 in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#19 0x00007f40870b2371 in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#20 0x00007f40870a534c in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#21 0x00007f408708cf3e in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#22 0x00007f40870c54c4 in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#23 0x00007f4086ee8d92 in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#24 0x00007f4086ee99b6 in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#25 0x00007f4086eeac4a in ?? () from /usr/local/cuda/lib64/libcufft.so.9.1
#26 0x00007f4086eeaf00 in cufftLockPlan () from /usr/local/cuda/lib64/libcufft.so.9.1
#27 0x00007f4086ef755a in cufftXtMakePlanMany () from /usr/local/cuda/lib64/libcufft.so.9.1

And kernel stack trace:
[] 0xffffffffffffffff

When I examine the /proc///status for those 8 threads, all are in SLEEP state. These threads never get out of this state, and the app hangs forever.

This happens once every few runs. Otherwise the app completes with correct results.

It appears that the 8th thread locked the mutex and later got stuck, while remaining 7 threads are trying to acquire the same mutex and end up waiting forever.

Has anyone seen this before?

Any help greatly appreciated!

Thanks

y_gpu