Parallelizing data transfer with kernel execution

mjacobson · January 11, 2014, 8:45pm

I have an application where I run a kernel several hundred times sequentially. Is there some way to cudaMemcpy data generated in a previous iteration of the kernel back to the host, while a current iteration is in progress? In other words to pipeline kernel execution with device-to-host data transfer?

CudaaduC · January 11, 2014, 10:00pm

Which GPU are you using? I believe that the dual-copy engine on the Tesla line can do exactly that.

mjacobson · January 12, 2014, 1:18am

Thanks, but I’m using the GTX 580. So only Teslas can do it? I have ordered a newer card, GTX 700 something…

CudaaduC · January 12, 2014, 3:23am

I believe only Teslas and the higher-end Quadro GPUs offer this feature.

The GTX 700 series is very fast in terms of 32 bit Gflops and has a quick PCI-e 3.0 host-device and device host copy speed (assuming the motherboard is same version), but (as far as I know) do not have the dual-copy engine.

allanmac · January 12, 2014, 4:27am

Additional suggestions…

If you don’t have a dual copy capable device then you should be able to achieve a similar result by leveraging concurrent kernel execution by creating two or more streams and then launching your compute kernel followed by a custom device-to-host copy kernel in each stream.

The copy kernel would write from the device back to the host into mapped pinned memory.

It’s a more complex solution though.

mjacobson · January 12, 2014, 3:04pm

That looks promising. Thanks. Would there be a good tutorial anywhere on the topic of streams?

pasoleatis · January 12, 2014, 5:27pm

Overlapping copying with calculation is well documented in the Programming Guide. Take a look here (thought there is more about this in the guide) Programming Guide :: CUDA Toolkit Documentation

seibert · January 13, 2014, 1:10am

It was my understanding that the GeForce devices had 1 DMA engine, allowing 1 device-to-host, host-to-device, or device-to-device copy to proceed in parallel with kernel execution. You just need to make sure that you use cudaMemcpyAsync in a different cudaStream than the kernel is executing in.

The Tesla and Quadro series have 2 DMA engines, which lets you implement a triple-buffered data-processing pipeline. You can then overlap the transfer of one buffer from host-to-device, the processing a second buffer in a kernel, and the transfer of a third buffer from device-to-host.