Does cudaMemcpyDeviceToDevice use both copy and execution engines?

The documentation for cudaMemcpy*() says that cudaMemcpyDeviceToDevice copies never overlap with kernel execution. Does this mean that the GPU’s execution engine (by way of specialized NVIDIA-coded kernels) actually performs the memory copy? Does this mean that the GPU’s copy engines are left idle, or does a cudaMemcpyDeviceToDevice occupy both a copy engine and an execution engine?

huh, I should check on that. I think that’s leftover from the days of no concurrent kernels. DtoD copies are usually (not always) implemented with a kernel, and I think they can overlap with other kernels when the DtoD copy is performed with a kernel.

Is this true for cudaArrays, including when the source or destination is in linear memory?

I wrote a quick test program that did DtoD copies using cudaMemcpy2D(), cudaMemcpy2DFromArray(), cudaMemcpy2DToArray(), and cudaMemcpy2DArrayToArray(). I then ran this in cuda-gdb with “set cuda break_on_launch system”. Only cudaMemcpy2D() triggered a breakpoint (as tmurry said it would). However, no breakpoints were triggered for the array-based copies, so I must assume that these commands were carried out on the GPU without running kernels. I suppose the copy engine was used to perform this work?

I think that’s leftover from the days of no concurrent kernels.