In cudaMemcpyAsync API function reference (http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79) it is written that:
“If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams”
And it is strange, because I was able to achieve overlapping between kernel and device-to-device copy in different (non-default) streams. Sorry if it is a silly question, but I can’t understand the meaning of the above quote. And I’m used to think that each and every word in a documentation is meaningful.