Why there is no overleaping when execute CUDA kernel and copy operation in parallael

When I profiled CUDA code, I observed that copy operations and kernel execution operations occurring on different streams were not being parallelized.

The streams were created by defining cuBLAS operations on thread 1 and then executing them on thread 2, while the copy operation was launched after the cuBLAS soperation on thread 2.

The streams were implicitly created using the compilation flag: --default-stream per-thread.

The profile can be seen the picture below.

What is the reason that prevents operations overlapping?
And there is guarantee that the overlap will be not occur on versions above CUDA 11.8?

That doesn’t sound like a valid recipe. cublas library code is precompiled as a library, and is unaffected by your compilation settings for default stream behavior. Your profiler picture makes it fairly evident that the cublas calls are launched into the legacy default stream.
My suggestion would be to explicitly create streams, and explicitly set those streams for use by cublas using cublasSetStream.

Thank you for the answer for the stream issue.
But I still not understand why there is no overlap between those streams copy and kernel execution?
That’s the behavior which I want but it’s seems like an undefined behavior and it can be changed any time, am I wrong?

if the cublas activities (i.e. the dark blue items on the “Kernels” row and the “Default stream 7” row) are being launched into the legacy default stream (which is the way it appears to me) then there is no possibility for anything launched into the legacy default stream to overlap with any other activity on that GPU. That is a fundamental principle of stream semantics.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.