Could someone helpme to achieve overlapping between computation and transfer in GTX Titan card?

I can not effectively overlap execution and memory transfers using a GTX Titan card. Its really a shame, it seems any memory operation gets in-order (doesn’t have this card a GK110 chipset?).

Anyways, I’m asking for a good strategy in achieving the best overlapping on a typical scenario like:

stream1 -> memcpy_HtoD1; kernel_exec1; memcpy_DtoH2; …
stream2 -> memcpy_HtoD2; kernel_exec2; memcpy_DtoH2; …

I have tried many different approach for this but haven’t success in any of them.

link to stackoverflow question:

I think your Titan can only overlap memcopies with kernels.
Try using one stream for transfers and another for kernels?


When I run the the deviceQuery example from the SDK for my Titan, I get this:

Concurrent copy and kernel execution:          Yes with 1 copy engine(s)

My guess is that this means you can not overlap 2 copying.

Since your CPU memory is already pinned, you could try implementing explicit copies in your kernels and stop using host initiated memcopies. This assumes your kernels can overlap.