There’s a case that execution of kernels in different streams fail to be overlapped with copy operations in the streams.
e.g.:
kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host
As the code shows, both kernel kernel1 and kernel2 are small enough to be executed simultaneously. But according to the profiling result, kernel1 and kernel2 are executed serially.
If I delete copy operations or assign them to other streams like:
e.g.:
Both kernel kernel1 and kernel2, even the first copy operation, are executed simultaneously.
I’d like to know how to overlap execution of kernels in different streams with copy operations so that kernel1 and kernel2 run in parallel and results of kernels are transferred to host right after kernel execution complete
kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host
Then, streams only indicate dependencies / ordering of operation. Operations in different streams are independent and could be run simultaneously. But there are no guaranties by the CUDA driver that they will actually run simultaneously.
Maybe I should make it clear that there’s no dependency betweenkernel1 and kernel2.
As the code snippet shows
kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host
kernel1, kernel2 and their copy operations are submitted to the streams, respectively.
If there’s no copy operation in stream1/2, kernel1and kernel2 run parallelly.
But when a copy operation is added into stream1, kernel2 in stream2 is blocked even if that copy operation is not running., as the figure depicts
As I have stated, the programmer can only hint which operations may be run in parallel. The driver is free to ignore them for example when not enough resources are available.
Can you share a minimal runnable code example which shows your observation?