CUDA introduces heavy locks?

I’m using tensorflow to do an online serving. There are multiple threads processing incoming queries and call session run separately. But when the number of incoming queries reaches a certain amount, the performance is not increasing anymore, even though there are more queries to process. The CPU is not full, the GPU is not full either. (GPU utilization ~50-60%). I try to pstack the process, and I found that it seems the cuda or the cuda driver introduces a very heavy lock. Here’s a snippet of what I found:

Thread 101 (Thread 0x7edd5af5d700 (LWP 121193)):
#0 0x00007f2fd1dd642d in __lll_lock_wait () from /lib64/
#1 0x00007f2fd1dd1de6 in _L_lock_870 () from /lib64/
#2 0x00007f2fd1dd1cdf in pthread_mutex_lock () from /lib64/
#3 0x00007f2fa7a29446 in ?? () from /usr/lib64/nvidia/
#4 0x00007f2fa7a29478 in ?? () from /usr/lib64/nvidia/
#5 0x00007f2fa79484e0 in ?? () from /usr/lib64/nvidia/
#6 0x00007f2fa794b545 in ?? () from /usr/lib64/nvidia/
#7 0x00007f2fa7a867c2 in cuMemcpyHtoDAsync_v2 () from /usr/lib64/nvidia/
#8 0x00007f2fb97ff8f3 in perftools::gputools::cuda::CUDADriver::AsynchronousMemcpyH2D (context=, gpu_dst=1108569878528, host_src=0x102160dde00, size=4320, stream=0x7ee102c847d0) at tensorflow/stream_executor/cuda/

There are “in ?? () from /usr/lib64/nvidia/” all over the place. And half of the threads are blocked by the lock.
So I’m wondering if cuda driver introduces a lock while launching kernel and not release it until the kernel returns.
And I’d like to know if there’s anyway I can alleviate the side-effect of the lock if it really exists? Since the GPU is far from being well loaded, there’s a lot more computing power that I can squeeze, but hindered by the lock.

The hardware is P100. The software is cuda 8.0, driver version 375.26.

Thanks for any comments.

is every operation you are requesting happening in a dedicated stream, or is something occuring in the default stream also? The default stream would enforce synchronization as far as I know - and that could block the other streams temporarily. The CUDA profiler could show you what is going on.

Are you using by chance non page locked host buffers for cudaMemcpyAsync()? This could make an async memcpy synchronous, in fact.

I am not very familiar with P100 in detail, but there may be restrictions that could apply

  • the number of asynchronous copy engines
  • the length of the command and kernel launch queues to the device (there may be hardware or driver limits)

I would expect blocking to occur when such a limit is hit.


Thanks @cbuchner1 for very insightful comments! A tensorflow session is using only one compute stream for all compute kernels, I assume. In fact, I can further increase the performance and the GPU utilization by launching multiple GPU contexts (modify the tf source code to bind each session to a different GPU context and use multiple sessions) and switching between them. Does this fact confirm that we are hitting some hardware/driver limits?

The problem with using multiple contexts is that, the cost switching contexts is high, and we can not actually benefit from overlapping the data transfer with computation. Because when one context is active, the others are inactive, for pascal architectures if I get it right. In my case, the data transfer from host to device is heavy and may reach 20-30% of GPU time.

Do you have any suggestions for optimizing my case?

Thanks again for your help.

You can raise your issue with the Tensorflow developers (maybe there is a mailing list for discussing such topics).

What I would find excellent (from a developer’s perspective) is a change to the CUDA API so that you can issue a cudaSetCurrentStream() so all successive API calls default to operating within the given stream. This would be ideal to make non-stream aware libraries run on the desired stream without having to change the library code (which sometimes isn’t even possible, for closed source products at least).