I was under the impression that cudaMemcpyAync can only overlap if
- host memory is page-locked
- different direction (D2H vs H2D)
- on different CUDA stream.
I was profiling a unit test and observe that Nsight System shows two D2H memcpy that overlap in time:
Is my impression incorrect?
ref: How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog
At a high level, your interpretation and rubric is correct and the right one for CUDA developers to keep in mind (IMO). However, there is considerable complexity in the details as well as facts that aren’t fully reflected in your high-level rubric. For example, the tail end of one transfer can overlap with the head end of another transfer for certain cases using pageable memory as discussed here. That may or may not be applicable to your case. You haven’t provided a complete example, so I can’t give a definitive answer to what is happening precisely in your case (and I don’t know if I would, anyway), but with respect to your question about the rubric, yes, that is not a perfectly accurate, complete, bullet-proof formula. But it is a good guide, and other than curiosity about what appears to be an oddity, I consider it to be a useful and sufficient guide.
(Also, not applicable to your case, two transfers in the same direction can overlap if they are targetting different devices, in some cases, and depending on system topology)