cudnnCreate() / cublasCreate() blocked while CUDA kernels run in parallel (irrespective of process)

I am running a gpu based algorithm that is executing the CUDA kernel on a gpu. At the same time, I am running an other cudnn / cublas based algorithm in parallel on the same gpu. I see that the cudnnCreate() / cublasCreate() is blocked until the gpu kernel in the same or another process completes.

From cudnn/cublas documentation it is clear that these functions call cudaDeviceSynchronize() from within and hence they would block until the gpu completes all the tasks in queue. But the cudaDeviceSynchronize() will wait only for the tasks from the current context to complete, right? So why do these functions block even when the kernels run in different context and even while running in different process?

I don’t know the precise reason, but the behavior doesn’t surprise me. Let’s take an example. If you attempt to do a cudaMalloc operation, that blocks, waiting until (as far as I know) the GPU is completely idle (all contexts). One reason for this is that this affects the overall memory map of the GPU, and likewise you could imagine that there also needs to be some kind of serialization as you are using up GPU resources. You can’t pretend that the GPU memory (which is obviously shared amongst all contexts) can be managed as if only one context is using it.

Given that, my observation is that a cudaMalloc is very often a synchronizing operation, and I suspect that can affect any context on that particular GPU.

Now, what happens if the cudnnCreate or cublasCreate call needs to allocate device memory?

Thanks Robert,

I wrote a small prototype where I ran 2 kernels in parallel across different processes, different threads and I could see them run parallel. As well these processes/threads where doing cudamalloc, host2Device copy, run the kernel and then device2Host copy. But still the parallelism was not inhibited by any other factor. So I am just wondering what is that pressing need to device synchronize when initializing the cuDNN / CUBLAS library using cuDNNCreate() / cublasCreate()!

Because for me this constraint implicitly means that we cannot run 2 cuDNN/CUBLAS based algorithms (across processes / across threads) in parallel on the same GPU!

In the same way that your small prototype worked as desired (i.e. kernels could run in parallel), the cuDNN and CUBLAS based algorithms can run in parallel. Both processes do all cublasCreate or cuDNNCreate calls first (just as both of your prototype did cudaMalloc first) then both processes can run whatever work they wish to do with those handles already created. If you are constantly creating/destroying handles, etc. that is probably a bad pattern, and you should seek to refactor that, just like in an ordinary CUDA code you would seek to reuse device memory allocations, rather than constantly cudaMalloc/cudaFree.

Furthermore, generally witnessing kernel concurrency requires “small” kernels that execute for some reasonable/visible duration, and have low resource utilization, so that 2 of them could actually run concurrently. Likewise the same ideas hold for cuDNN/cublas. If you issue large enough work, you will not witness concurrency in kernels issued by cuDNN or cublas in 2 separate processes, because the GPU does not have enough compute resources to support both running concurrently.

For cublas we can quickly estimate the “size” of work that will saturate a GPU, preventing any ability to witness concurrency. Simply taking whatever matrix or array size you are using, and convert that into threads, and divide that by the thread-carrying capacity of the GPU you are running on. If the number of threads is equal to or greater than the thread carrying-capacity of the GPU you are running on, it’s unrealistic to expect to witness concurrency, whether with ordinary CUDA kernels, or cublas or cuDNN calls.