I am in working on a proposal for setting up a CUDA Linux box for my research lab and I was hoping someone could give me a concise run down of the advantages of the dual copy engine available on the Teslas/Quadros vs. the lack thereof in the cheaper GeForce. This will be for extensive floating point calculations with the for computational chemistry.
“copy engine” refers to a DMA mechanism. DMA allows the transfer of data between host and device while a kernel is execution on the GPU. The benefit of dual copy engines, coupled with the fact that PCIe is a full duplex interconnect, is that you can build a “perfect” pipeline, where the following can happen simultaneously:
(1) Upload results from data chunk n-1 from device to host
(2) Run kernel that operates on data chunk n
(3) Download data chunk n+1 from host to device
In the ideal case, all data copies are completely overlapped by kernel execution, i.e. the transfers are “free”. With a single copy engine, steps (1) and (3) above cannot run concurrently. Note that the simultaneous upload / download of large amounts of data across PCIe can use up a significant portion of available system memory bandwidth. This shouldn’t be a problem for modern host system.
Does cudaThreadSynchronize() have any effect on these copies(assuming a Tesla K20)?
In other words lets say I have a host loop which first copies memory from host to device, calls a couple kernels with cudaThreadSynchronize(), then copies back the new results to host.
How does the order of the kernels and copies affect the overlapping transfers?