“copy engine” refers to a DMA mechanism. DMA allows the transfer of data between host and device while a kernel is execution on the GPU. The benefit of dual copy engines, coupled with the fact that PCIe is a full duplex interconnect, is that you can build a “perfect” pipeline, where the following can happen simultaneously:
(1) Upload results from data chunk n-1 from device to host
(2) Run kernel that operates on data chunk n
(3) Download data chunk n+1 from host to device
In the ideal case, all data copies are completely overlapped by kernel execution, i.e. the transfers are “free”. With a single copy engine, steps (1) and (3) above cannot run concurrently. Note that the simultaneous upload / download of large amounts of data across PCIe can use up a significant portion of available system memory bandwidth. This shouldn’t be a problem for modern host system.
You may also want to look at http://www.nvidia.com/object/why-choose-tesla.html