When I profiled CUDA code, I observed that copy operations and kernel execution operations occurring on different streams were not being parallelized.
The streams were created by defining cuBLAS operations on thread 1 and then executing them on thread 2, while the copy operation was launched after the cuBLAS soperation on thread 2.
The streams were implicitly created using the compilation flag: --default-stream per-thread.
That doesn’t sound like a valid recipe. cublas library code is precompiled as a library, and is unaffected by your compilation settings for default stream behavior. Your profiler picture makes it fairly evident that the cublas calls are launched into the legacy default stream.
My suggestion would be to explicitly create streams, and explicitly set those streams for use by cublas using cublasSetStream.
Thank you for the answer for the stream issue.
But I still not understand why there is no overlap between those streams copy and kernel execution?
That’s the behavior which I want but it’s seems like an undefined behavior and it can be changed any time, am I wrong?
if the cublas activities (i.e. the dark blue items on the “Kernels” row and the “Default stream 7” row) are being launched into the legacy default stream (which is the way it appears to me) then there is no possibility for anything launched into the legacy default stream to overlap with any other activity on that GPU. That is a fundamental principle of stream semantics.