got the following structure of programm:
- copying some rendering data (cudaArray*) from GPU 1 to GPU 2 by cudaMemcpyArrayToArray
- parralell executing of kernels on both gpu’s
- copying output from 2nd to 1st
and the problem is that cudaMemcpyArrayToArray is extremely slow (data is about 50mb only).
why is it so slow and is there a way to copy cudaArray from one device to another without cudaMemcpyArrayToArray ?