Dear all,
According to the the CUDA programming guide, section 3.2.8.5.3 Implicit Synchronization
Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
- …
- a device memory set,
- …
- any CUDA command to the NULL stream,
I assume “a device memory set” contains cudaMemsetAsync. From my understanding,1, 2 can happen, while 3 cannot
In other words, if a cudaMemsetAsync command’s issue (the starting point of the bar) happens between two kernel execution commands’ issue, then the execution of two kernels must be serialized (no bar overlap between kernel1 and kernel2). But the execution of cudaMemsetAsync and a kernel can be parallelized.
But I am confused after profiling an inference service that assigns each request handler thread a dedicated CUDA stream. I found scenario 3 happening.
My environment: CUDA11, V100S GPU
Also, I found a cudaMemsetAsync on the default stream being executed in parallel with another kernel in a different stream. This seems to conflict with this post
no operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.
Could you please explain the reason why the behaviors I posted are expected, and clarify a bit more about section 3.2.8.5.3? Any help will be appreciated.
Yang