I’m using tensorflow to do an online serving. There are multiple threads processing incoming queries and call session run separately. But when the number of incoming queries reaches a certain amount, the performance is not increasing anymore, even though there are more queries to process. The CPU is not full, the GPU is not full either. (GPU utilization ~50-60%). I try to pstack the process, and I found that it seems the cuda or the cuda driver introduces a very heavy lock. Here’s a snippet of what I found:
Thread 101 (Thread 0x7edd5af5d700 (LWP 121193)):
#0 0x00007f2fd1dd642d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f2fd1dd1de6 in _L_lock_870 () from /lib64/libpthread.so.0
#2 0x00007f2fd1dd1cdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f2fa7a29446 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4 0x00007f2fa7a29478 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#5 0x00007f2fa79484e0 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#6 0x00007f2fa794b545 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#7 0x00007f2fa7a867c2 in cuMemcpyHtoDAsync_v2 () from /usr/lib64/nvidia/libcuda.so.1
#8 0x00007f2fb97ff8f3 in perftools::gputools::cuda::CUDADriver::AsynchronousMemcpyH2D (context=, gpu_dst=1108569878528, host_src=0x102160dde00, size=4320, stream=0x7ee102c847d0) at tensorflow/stream_executor/cuda/cuda_driver.cc:1228
There are “in ?? () from /usr/lib64/nvidia/libcuda.so.1” all over the place. And half of the threads are blocked by the lock.
So I’m wondering if cuda driver introduces a lock while launching kernel and not release it until the kernel returns.
And I’d like to know if there’s anyway I can alleviate the side-effect of the lock if it really exists? Since the GPU is far from being well loaded, there’s a lot more computing power that I can squeeze, but hindered by the lock.
The hardware is P100. The software is cuda 8.0, driver version 375.26.
Thanks for any comments.