I am running a gpu based algorithm that is executing the kernel on a gpu. At the same time, I am running an other cudnn / cublas based algorithm in parallel. I see that the cudnnCreate() / cublasCreate() is blocked until the gpu kernel in the same or another process completes.
From cudnn/cublas documentation it is clear that these functions call cudaDeviceSynchronize() from within and hence they would block until the gpu completes all the tasks in queue. But the cudaDeviceSynchronize() will wait only for the tasks from the current context to complete, right? So why do these functions block even when the kernels run in different context and even while running in different process?