copy cudaArray from one device to another

Hi!
got the following structure of programm:

  1. copying some rendering data (cudaArray*) from GPU 1 to GPU 2 by cudaMemcpyArrayToArray
  2. parralell executing of kernels on both gpu’s
  3. copying output from 2nd to 1st

and the problem is that cudaMemcpyArrayToArray is extremely slow (data is about 50mb only).
why is it so slow and is there a way to copy cudaArray from one device to another without cudaMemcpyArrayToArray ?
Thanks!