asyncEngineCount and peer-to-peer copies

If I have two devices (A and B) in a Tesla S2050 box, with asyncEngineCount == 2, and these devices have peer access mutually enabled, can I perform simultaneous copies from A to B and from B to A?

Maybe I’m missing it in the documentation…

TIA, Vedran

the P2P interface supports that, although we intentionally make this kind of overlap difficult to make P2P transfers behave more sanely the rest of the time. basically, the rule for P2P is:

  • one of the copy engines will always be used for P2P transfers (I forget which), regardless of whether you are reading from or writing to a remote GPU
  • we will never ever use the other copy engine for a P2P transfer

so, you probably have to do

cudaMemcpyAsync(srcBuf0, dstBuf1, size, cudaMemcpyGeneric, stream0);
cudaMemcpyAsync(srcBuf1, dstBuf0, size, cudaMemcpyGeneric, stream1);

in order to get overlap

Thanks tmurray, that solves my problem!
And if I understand correctly the copy engines part of your answer, two simultaneous transfers from a device - one to a peer and one to the host - should also be possible.