I have an application where I run a kernel several hundred times sequentially. Is there some way to cudaMemcpy data generated in a previous iteration of the kernel back to the host, while a current iteration is in progress? In other words to pipeline kernel execution with device-to-host data transfer?
Which GPU are you using? I believe that the dual-copy engine on the Tesla line can do exactly that.
Thanks, but I’m using the GTX 580. So only Teslas can do it? I have ordered a newer card, GTX 700 something…
I believe only Teslas and the higher-end Quadro GPUs offer this feature.
The GTX 700 series is very fast in terms of 32 bit Gflops and has a quick PCI-e 3.0 host-device and device host copy speed (assuming the motherboard is same version), but (as far as I know) do not have the dual-copy engine.
If you don’t have a dual copy capable device then you should be able to achieve a similar result by leveraging concurrent kernel execution by creating two or more streams and then launching your compute kernel followed by a custom device-to-host copy kernel in each stream.
The copy kernel would write from the device back to the host into mapped pinned memory.
It’s a more complex solution though.
That looks promising. Thanks. Would there be a good tutorial anywhere on the topic of streams?
Overlapping copying with calculation is well documented in the Programming Guide. Take a look here (thought there is more about this in the guide) http://docs.nvidia.com/cuda/cuda-c-programming-guide/#streams
It was my understanding that the GeForce devices had 1 DMA engine, allowing 1 device-to-host, host-to-device, or device-to-device copy to proceed in parallel with kernel execution. You just need to make sure that you use cudaMemcpyAsync in a different cudaStream than the kernel is executing in.
The Tesla and Quadro series have 2 DMA engines, which lets you implement a triple-buffered data-processing pipeline. You can then overlap the transfer of one buffer from host-to-device, the processing a second buffer in a kernel, and the transfer of a third buffer from device-to-host.