Need an easy explanation of Dual Copy Engine.

I am in working on a proposal for setting up a CUDA Linux box for my research lab and I was hoping someone could give me a concise run down of the advantages of the dual copy engine available on the Teslas/Quadros vs. the lack thereof in the cheaper GeForce. This will be for extensive floating point calculations with the for computational chemistry.


“copy engine” refers to a DMA mechanism. DMA allows the transfer of data between host and device while a kernel is execution on the GPU. The benefit of dual copy engines, coupled with the fact that PCIe is a full duplex interconnect, is that you can build a “perfect” pipeline, where the following can happen simultaneously:

(1) Upload results from data chunk n-1 from device to host
(2) Run kernel that operates on data chunk n
(3) Download data chunk n+1 from host to device

In the ideal case, all data copies are completely overlapped by kernel execution, i.e. the transfers are “free”. With a single copy engine, steps (1) and (3) above cannot run concurrently. Note that the simultaneous upload / download of large amounts of data across PCIe can use up a significant portion of available system memory bandwidth. This shouldn’t be a problem for modern host system.

You may also want to look at

Does cudaThreadSynchronize() have any effect on these copies(assuming a Tesla K20)?

In other words lets say I have a host loop which first copies memory from host to device, calls a couple kernels with cudaThreadSynchronize(), then copies back the new results to host.

How does the order of the kernels and copies affect the overlapping transfers?

Thanks, njuffa, that will hopefully put it on a level my boss can understand.