Overlapping data transfers with kernel execution

Mark mentioned a long time ago that CUDA may introduce support for overlapping of data transfer to and from the GPU with the execution of a kernel on the GPU (i’m not referring to overlapping CPU and GPU execution btw). Does anyone know if this is supported yet?

You can do this with async memcopies and streams, check the simpleStreams project in the SDK for some hints

Great, thanks!

One other thing. The documentation states clearly that kernel execution and data transfers can overlap. However, it doesn’t seem to specifically state whether transfers to the device and from the device can overlap. Is this possible (physically the pcie bus supports it…right?)?

This is a hardware feature and only supported in a few cards. Check the “Device properties” explanation in the programming guide. Only if the device property indicates it…

yea cool, i got that. my card doesn’t support it…maybe an excuse to buy a new one ;) just wondering now about the overlap of download and readback to the device.

Oh! I would love to read more up on exactly which cards support this – I tried to google up some info but I didn’t find anything. Are we talking only super-high-end cards or G92-based cards or later or G200? Any hint would be useful! Thanks.

Sarnath was exaggerating a bit by saying “only a few”. In truth, “only a few” don’t support it.

Specifically, any G80 based card will not support concurrent copy and execution (8800 GTX, Tesla C870, and 8800 GTS (original)). ALL other cards G92 and newer do support it.

However, one important caveat (at least currently with CUDA 2.1), Vista users will never get concurrent copy and execution. From the release notes:

Thats a good catch. Thanks.

Ouch! Just as well I use linux :D ooo so tempted to by a gtx285…a sweet beast.

On the concurrency of data transfer thing, it’s surprising that the docs dont’ specifically cover it. The example given in the programming guide sets up 2 streams with a simple host2device copy, kernel call and device2host copy instructions. They kinda gloss over exactly which instructions will operate concurrently. If data transfers do happen concurrently then you could add another stream and reduce running time to 1/3 (theoretically). Not having a card that supports this I cant test it ;(

Pay attension, vista don’t support async data transfer or function call. It wasted my 12 hours to figure out it.