I’m trying to do some streaming like signal processing on my Tesla C870 and therefore have some memcopy questions:
I would like to use ping pong buffering.
Is it possible to “hide” memcopies (at least host <-> device) behind processing algorithms? If a memcopy sets up a DMA this should be able to copy on its own.
If so, how can I make sure a memcopy is finished before I use the copied data?
Maybe the keyword here is “asynchronous” memcopy. But I don’t really know, how to deal with streams. Could anyone please explain briefly how to use them?
If I use the profiler output of my program so far there is a lot of idle time before a memcopy device to host. This delay is depending on the size of the data to be copied back to host memory. For 280MB I see an idle time of about 0.31s which is more than the complete “busy” time of my program. I have already put the malloc to the initialization of the program, but this didn’t solve the problem. Could this be a latency connected to the DRAM on my host?
One additional note here: The GPU in the Tesla cards cannot overlap GPU execution and host<->device memory transfer. Newer GPUs have this capability, though. All GPUs, including Tesla, can overlap GPU execution and memory copies with CPU execution as MisterAnderson42 describes.
Your code does not need cudaThreadSynchronize to be correct. Adding it could lower your performance, especially if you are wanting to use streams (though, I’m not sure if you can control the stream that cufft uses…)
The issue with the apparent overlaps you noticed is that the timestamp in the profiler appears to record the time when the operation was added to the queue: not when the operation actually executes on the GPU.