concurrency of device to device copy

it is written in the cuda docs that memory copies between two addresses to the same device memory is always concurrent.
[url]Programming Guide :: CUDA Toolkit Documentation

so my question here: is it possible that the device to device copy works concurrently like several independent kernels (I mean can it have 2 or more device to device copies at the same time) ? or is it still dictated by the asynchronous copy engine?

thanks