Hi,
I can not effectively overlap execution and memory transfers using a GTX Titan card. Its really a shame, it seems any memory operation gets in-order (doesn’t have this card a GK110 chipset?).
Anyways, I’m asking for a good strategy in achieving the best overlapping on a typical scenario like:
stream1 → memcpy_HtoD1; kernel_exec1; memcpy_DtoH2; …
stream2 → memcpy_HtoD2; kernel_exec2; memcpy_DtoH2; …
I have tried many different approach for this but haven’t success in any of them.
link to stackoverflow question:
[url]cuda - What is the best strategy to overlap kernel execution and data transfers in a GTX Titan card? - Stack Overflow