I can not effectively overlap execution and memory transfers using a GTX Titan card. Its really a shame, it seems any memory operation gets in-order (doesn’t have this card a GK110 chipset?).
Anyways, I’m asking for a good strategy in achieving the best overlapping on a typical scenario like:
stream1 -> memcpy_HtoD1; kernel_exec1; memcpy_DtoH2; …
stream2 -> memcpy_HtoD2; kernel_exec2; memcpy_DtoH2; …
I have tried many different approach for this but haven’t success in any of them.
link to stackoverflow question: