Multi-thread running CUDA program using CUDA 9.0 on V100 will hang.

When I run my CUDA program on one V100 using CUDA 9.0 by multi-threads, the program will hang all the time after several iterations. The position of multi-time hang is similar, it happened during cuLaunchKernel.
The gdb info is the following:

#0  0x00007fed698d8827 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fed17b4fcfd in ?? () from /usr/lib64/libcuda.so.1
#2  0x00007fed17b50494 in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007fed17d6e0e3 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007fed17d6e2db in ?? () from /usr/lib64/libcuda.so.1
#5  0x00007fed17d6e367 in ?? () from /usr/lib64/libcuda.so.1
#6  0x00007fed17c19c31 in ?? () from /usr/lib64/libcuda.so.1
#7  0x00007fed17d49126 in ?? () from /usr/lib64/libcuda.so.1
#8  0x00007fed17b33ab6 in ?? () from /usr/lib64/libcuda.so.1
#9  0x00007fed17b33ce3 in ?? () from /usr/lib64/libcuda.so.1
#10 0x00007fed17c8bb70 in cuLaunchKernel () from /usr/lib64/libcuda.so.1
#11 0x00007fed23f4ee5b in cudart::cudaApiLaunchCommon(void const*, bool) () from /pp/dev/python/pp/fluid/core.so
#12 0x00007fed23f6c638 in cudaLaunch () from /pp/dev/python/pp/fluid/core.so
#13 0x00007fed23647af9 in op::AdamOp::Compute(fw:Context const&) const () from /pp/dev/python/pp/fluid/core.so

-------------

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fed69bc1e42 in __GI___pthread_mutex_lock (mutex=0x3a85c60) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007fed17b4462e in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007fed17b4479a in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007fed17c8a1f7 in cuMemsetD8Async () from /usr/lib64/libcuda.so.1
#5  0x00007fecdfec71ad in ?? () from /usr/local/cuda/lib64/libcublas.so
#6  0x00007fecdfe986de in ?? () from /usr/local/cuda/lib64/libcublas.so
#7  0x00007fecdfede161 in ?? () from /usr/local/cuda/lib64/libcublas.so
#8  0x00007fecdfe6e04f in ?? () from /usr/local/cuda/lib64/libcublas.so
#9  0x00007fecdfd2a7a8 in ?? () from /usr/local/cuda/lib64/libcublas.so
#10 0x00007fecdfd2babd in ?? () from /usr/local/cuda/lib64/libcublas.so
#11 0x00007fecdfd2c44a in ?? () from /usr/local/cuda/lib64/libcublas.so
#12 0x00007fecdfac03d5 in cublasSgemm_v2 () from /usr/local/cuda/lib64/libcublas.so
#13 0x00007fed226742c7 in void op::math::GEMM<float>(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, float, float const*, float const*, float, float*) const ()

And the threads number is two, It seems that there is deadlock when launch cuda kernel.

But when I run the same program using CUDA 8.0, the program works well.

maybe you should try cuda 9.1, 9.2, or 10.0 (probably 10.0 would be best) bugs get fixed all the time
You could also file a bug at developer.nvidia.com. You would be asked for a complete test case.

Thanks!
Does this mean that CUDA 9.0 has bugs when it is called by multiple threads?

When I use four threads, the gdb info is:

Thread 7 (Thread 0x7fbe972b6700 (LWP 79492)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fc09e4f2e42 in __GI___pthread_mutex_lock (mutex=0x39b1680) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007fc04c556c8d in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc04c6aeb70 in cuLaunchKernel () from /usr/lib64/libcuda.so.1
#4 0x00007fc058980ecb in cudart::cudaApiLaunchCommon(void const*, bool) () from /tb_dev/python/tb/core.so
#5 0x00007fc05899e6a8 in cudaLaunch () from /tb_dev/python/tb/core.so
...


Thread 8 (Thread 0x7fbe968b5700 (LWP 79493)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fc09e4f2e42 in __GI___pthread_mutex_lock (mutex=0x39b1680) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007fc04c554a91 in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc04c69c783 in cuFuncGetAttribute () from /usr/lib64/libcuda.so.1
#4 0x00007fc05897f519 in cudart::cudaApiFuncGetAttributes(cudaFuncAttributes*, void const*) () from /tb_dev/python/tb/core.so
#5 0x00007fc0589ada9f in cudaFuncGetAttributes () from /tb_dev/python/tb/core.so
#6 0x00007fc057c50ba4 in cudaError cub::DeviceReduce::Reduce, float const*, long>, float*, cub::Sum, float>(void*, unsigned long&, cub::TransformInputIterator, float const*, long>, float*, int, cub::Sum, float, CUstream_st*, bool) () from /tb_dev/python/tb/core.so
...


Thread 9 (Thread 0x7fbe95eb4700 (LWP 79494)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fc09e4f2e42 in __GI___pthread_mutex_lock (mutex=0x39b1680) at ../nptl/pthread_mutex_lock.c:115
#2 0x00007fc04c554a91 in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc04c69c783 in cuFuncGetAttribute () from /usr/lib64/libcuda.so.1
#4 0x00007fc05897f519 in cudart::cudaApiFuncGetAttributes(cudaFuncAttributes*, void const*) () from /tb_dev/python/tb/core.so
#5 0x00007fc0589ada9f in cudaFuncGetAttributes () from /tb_dev/python/tb/core.so
#6 0x00007fc0583f052b in thrust::cuda_cub::core::AgentLauncher, thrust::device_ptr, thrust::device_ptr, thrust::cuda_cub::__transform::no_stencil_tag, tb::op::AddFunctor, thrust::cuda_cub::__transform::always_true_predicate>, long> >::get_plan(CUstream_st*, void*) [clone .isra.585] () from /tb_dev/python/tb/core.so
...

No I wouldn’t say that. Since you’ve provided no test case, I have no idea what is happening in your case.

I admit the possibility of bugs, and typical bug-isolation process might include testing newer software stacks.

Thanks @Robert_Crovella, I have solved this problem, but I am not sure whether it is a bug of cublas.

When I use cuda-gdb to see the position of hang, I found that it is in gemm, and when I add lock before calling gemm, the program doesn’t hang. So I doubt that the operation of cublas may be not thread safety.

The cublas API is intended to be thread safe. One of the suggestions is to use a separate cublas handle for each thread.

Again, it might be a bug in CUDA/CUBLAS, but it is impossible to say from what you have shown here.

I’m hitting the same - or at least a very similar - issue with CUDA 10. Like the OP, I’m making CUDA calls in multiple threads. Unlike the OP, I’m using PyTorch. Here’re the tops of my backtraces:

Thread #1
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f5d2f7eb098 in __GI___pthread_mutex_lock (mutex=0x559660afa6d0) at ../nptl/pthread_mutex_lock.c:113
#2  0x00007f5cb1d98d55 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f5cb1d98f2e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f5cb1f1b060 in cuLaunchKernel () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f5d002d323d in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#6  0x00007f5d002d32c7 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#7  0x00007f5d0030746b in cudaLaunchKernel () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#8  0x00007f5cd216df6d in void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const, float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, c10::Scalar), &at::native::add_kernel_cuda, 4u>, float (float, float), float> const, float> const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
Thread #2
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f5d2f7eb098 in __GI___pthread_mutex_lock (mutex=0x559660afa6d0) at ../nptl/pthread_mutex_lock.c:113
#2  0x00007f5cb1d98d55 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f5cb1d98f2e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f5cb1f1b060 in cuLaunchKernel () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f5d002d323d in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#6  0x00007f5d002d32c7 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#7  0x00007f5d0030746b in cudaLaunchKernel () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#8  0x00007f5cd1a99ce7 in void THC_transformReduceInnermostDimIndex<float, long, c10::TensorImpl, c10::TensorImpl, MaxValuePair<float, long> >(THCState*, c10::TensorImpl*, c10::TensorImpl*, c10::TensorImpl*, thrust::pair<float, long> const&, MaxValuePair<float, long>) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so
Thread #3
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f5d2f7eb098 in __GI___pthread_mutex_lock (mutex=0x559660afa6d0) at ../nptl/pthread_mutex_lock.c:113
#2  0x00007f5cb1d98d55 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f5cb1d98f2e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f5cb1f1b060 in cuLaunchKernel () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007f5d002d323d in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#6  0x00007f5d002d32c7 in ?? () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#7  0x00007f5d0030746b in cudaLaunchKernel () from /opt/conda/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.10.0
#8  0x00007f5cd2239089 in void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &(void at::native::copy_kernel_impl<float, float>(at::TensorIterator&)), 1u>, float (float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &(void at::native::copy_kernel_impl<float, float>(at::TensorIterator&)), 1u>, float (float)> const&) () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch.so

This is happening in some fairly complex code so I don’t have a minimal example I’m afraid. I also can’t use cuda-gdb due to a bug. I’ll try mutexes and separate handles (if PyTorch supports that) but I suspect the answer is going to be ‘dont multithread PyTorch CUDA’.