cudaMemcpyAsync behavior


In cudaMemcpyAsync API function reference ( it is written that:

“If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams”

And it is strange, because I was able to achieve overlapping between kernel and device-to-device copy in different (non-default) streams. Sorry if it is a silly question, but I can’t understand the meaning of the above quote. And I’m used to think that each and every word in a documentation is meaningful.

Many thanks.

“operations in other streams”

i suppose one should then define ‘operations’

in plain words, i think the paragraph intends to point out that a) the copy engines of a device may operate independently of the SMs of the device, b) a device may have more than one copy engine, and may use both simultaneously