CUDA introduces heavy locks?

FanYE · May 17, 2018, 7:26am

I’m using tensorflow to do an online serving. There are multiple threads processing incoming queries and call session run separately. But when the number of incoming queries reaches a certain amount, the performance is not increasing anymore, even though there are more queries to process. The CPU is not full, the GPU is not full either. (GPU utilization ~50-60%). I try to pstack the process, and I found that it seems the cuda or the cuda driver introduces a very heavy lock. Here’s a snippet of what I found:

Thread 101 (Thread 0x7edd5af5d700 (LWP 121193)):
#0 0x00007f2fd1dd642d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f2fd1dd1de6 in _L_lock_870 () from /lib64/libpthread.so.0
#2 0x00007f2fd1dd1cdf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f2fa7a29446 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4 0x00007f2fa7a29478 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#5 0x00007f2fa79484e0 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#6 0x00007f2fa794b545 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#7 0x00007f2fa7a867c2 in cuMemcpyHtoDAsync_v2 () from /usr/lib64/nvidia/libcuda.so.1
#8 0x00007f2fb97ff8f3 in perftools::gputools::cuda::CUDADriver::AsynchronousMemcpyH2D (context=, gpu_dst=1108569878528, host_src=0x102160dde00, size=4320, stream=0x7ee102c847d0) at tensorflow/stream_executor/cuda/cuda_driver.cc:1228

There are “in ?? () from /usr/lib64/nvidia/libcuda.so.1” all over the place. And half of the threads are blocked by the lock.
So I’m wondering if cuda driver introduces a lock while launching kernel and not release it until the kernel returns.
And I’d like to know if there’s anyway I can alleviate the side-effect of the lock if it really exists? Since the GPU is far from being well loaded, there’s a lot more computing power that I can squeeze, but hindered by the lock.

The hardware is P100. The software is cuda 8.0, driver version 375.26.

Thanks for any comments.

cbuchner1 · May 17, 2018, 8:31am

is every operation you are requesting happening in a dedicated stream, or is something occuring in the default stream also? The default stream would enforce synchronization as far as I know - and that could block the other streams temporarily. The CUDA profiler could show you what is going on.

Are you using by chance non page locked host buffers for cudaMemcpyAsync()? This could make an async memcpy synchronous, in fact.

I am not very familiar with P100 in detail, but there may be restrictions that could apply

the number of asynchronous copy engines
the length of the command and kernel launch queues to the device (there may be hardware or driver limits)

I would expect blocking to occur when such a limit is hit.

Christian

FanYE · May 17, 2018, 10:32am

Thanks @cbuchner1 for very insightful comments! A tensorflow session is using only one compute stream for all compute kernels, I assume. In fact, I can further increase the performance and the GPU utilization by launching multiple GPU contexts (modify the tf source code to bind each session to a different GPU context and use multiple sessions) and switching between them. Does this fact confirm that we are hitting some hardware/driver limits?

The problem with using multiple contexts is that, the cost switching contexts is high, and we can not actually benefit from overlapping the data transfer with computation. Because when one context is active, the others are inactive, for pascal architectures if I get it right. In my case, the data transfer from host to device is heavy and may reach 20-30% of GPU time.

Do you have any suggestions for optimizing my case?

Thanks again for your help.

cbuchner1 · May 17, 2018, 12:18pm

You can raise your issue with the Tensorflow developers (maybe there is a mailing list for discussing such topics).

What I would find excellent (from a developer’s perspective) is a change to the CUDA API so that you can issue a cudaSetCurrentStream() so all successive API calls default to operating within the given stream. This would be ideal to make non-stream aware libraries run on the desired stream without having to change the library code (which sometimes isn’t even possible, for closed source products at least).

Christian

Topic		Replies	Views
Multi-GPU contention inside CUDA CUDA Programming and Performance cuda , nsight , performance	2	1209	January 11, 2023
Performance problem when loading multiple GPU system with independent simulations CUDA Programming and Performance	11	166	June 28, 2024
Too much time for kernel launch latency CUDA Programming and Performance	9	2438	November 28, 2022
CUDA peer resources error when running on more than 8 K80s (AWS p2.16xlarge) CUDA Programming and Performance	15	7332	October 12, 2016
Multiple Streams Performance CUDA Programming and Performance	9	6394	October 19, 2010
Cuda and tensorflow CUDA Developer Tools	0	1127	September 18, 2020
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9590	January 1, 2009
Using multiple streams with multiple host threads takes longer? stream CUDA Programming and Performance	3	897	February 10, 2021
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20125	May 4, 2007
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1452	September 14, 2017

CUDA introduces heavy locks?

Related topics