Could someone helpme to achieve overlapping between computation and transfer in GTX Titan card?

Hi,
I can not effectively overlap execution and memory transfers using a GTX Titan card. Its really a shame, it seems any memory operation gets in-order (doesn’t have this card a GK110 chipset?).

Anyways, I’m asking for a good strategy in achieving the best overlapping on a typical scenario like:

stream1 → memcpy_HtoD1; kernel_exec1; memcpy_DtoH2; …
stream2 → memcpy_HtoD2; kernel_exec2; memcpy_DtoH2; …

I have tried many different approach for this but haven’t success in any of them.

link to stackoverflow question:
[url]cuda - What is the best strategy to overlap kernel execution and data transfers in a GTX Titan card? - Stack Overflow

I think your Titan can only overlap memcopies with kernels.
Try using one stream for transfers and another for kernels?

Hello,

When I run the the deviceQuery example from the SDK for my Titan, I get this:

Concurrent copy and kernel execution:          Yes with 1 copy engine(s)

My guess is that this means you can not overlap 2 copying.

Since your CPU memory is already pinned, you could try implementing explicit copies in your kernels and stop using host initiated memcopies. This assumes your kernels can overlap.