As I understand new nvidia 790i chipset has support for broadcasting. So in multi GPUs platform memcopy from host->device(s) of same data is possible with only one MemCopy instead doing memcopy to each GPU. Does CUDA supports that?
And second question is:
What is the most efficient way to synchronize memory blocks between more GPUs?
For example, in 3 GPUs platform running some particles simulation where each GPU calculate one third of particles. The first GPU works on the first third, second GPU works on the second third and third GPU works on the last third of particles. After each simulation frame, they need to synchronize their memory blocks (because each particle is dependent on each other). By definition (if nothing has changed in this beta2 version) CUDA requires each device is accessed from different host thread. So it means host thread which communicate with ie. first device can not communicate with others two and so on. That leads to conclusion, after every frame when devices send their partial memory blocks to their own host thread, host threads need to synchronize over shared memory block and then copy proper thirds of block to their devices.
How that can be optimized?
Is it possible host threads update those data while kernels are running?
If kernels must be stopped, does it mean they must be loaded again when have to be started on next frame?
Can someone help