concurrency of device to device copy

it is written in the cuda docs that memory copies between two addresses to the same device memory is always concurrent.

so my question here: is it possible that the device to device copy works concurrently like several independent kernels (I mean can it have 2 or more device to device copies at the same time) ? or is it still dictated by the asynchronous copy engine?