cudaMemcpyAsync behavior

Hi.

In cudaMemcpyAsync API function reference (http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79) it is written that:

“If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams”

And it is strange, because I was able to achieve overlapping between kernel and device-to-device copy in different (non-default) streams. Sorry if it is a silly question, but I can’t understand the meaning of the above quote. And I’m used to think that each and every word in a documentation is meaningful.

Many thanks.

“operations in other streams”

i suppose one should then define ‘operations’

in plain words, i think the paragraph intends to point out that a) the copy engines of a device may operate independently of the SMs of the device, b) a device may have more than one copy engine, and may use both simultaneously