how to share data between two GPU?

In my previous version, I copy data/memory from each GPU to host, merge them and re-copy back to two devices.
Is it any better way to share data/memory directly between two devices?

Unfortunately, there is no better way.

With CUDA 2.2, you can improve the performance of this operation by allocating the merge buffer on the host with cudaAlloc and the ‘portable’ flag which gives you full pinned memory performance to both GPUs.

Can cudaMemcpyDeviceToDevice be used to copy data from one GPU’s global memory to another?

if yes, how fast? Is it faster than copying form device to host?

Unfortunately, it cannot. Device pointers are only valid within the thread context which called cudaMalloc(), and a host thread can only be associated with one device at a time. There’s actually no semantically correct way to have pointers to two devices in the same thread.