cuBLAS and dynamic parallelism

Does anyone know if the cuBLAS functions that are called inside kernels running on the device (as in Dynamic Parallelism) cause some sort of device synchronization?

What if I want to run couple of kernels concurrently on the same device using multiple streams, will the cuBLAS functions inside the kernel, cause a device sync, hence no concurrency be achieved??

Is there a way around this?


With the exception of few routines (returning a scalar value or involging CPU<->GPU transactions), most of cuBLAS routines are asynchronous when called from the host. However, I have never used them called within a kernel, so I do not know if they do or do not keep their asynchronous behavior. Perhaps, if none is definitely answering your question, it is worth a try and profile by yourself your code by the Visual Profiler to see if you observe concurrency…

Thanks. Yes I’ve run it through the profiler. It seems that the device is synchronized from the profiler output. I wanted to be sure and see if there’s a work around to that and understand why it is so.

I’ve just took the cdpLUDecomposition sample and run it on two streams concurrently - that causes sync in the profiler.

The LU sample code contains a call to cublasIzamax. This function caused the synchronization
as it returns a value.
Replacing this function by a custom kernel to find the max solved the problem and the LU
can be run on multiple streams concurrently.