CUDA & OpenMP: Synchronizing asynchronous operations on multiple GPUs

Dear all,

I am trying to write a FDTD code for multiple GPUs using CUDA & OpenMP in a manner similar to the OpenMP+CUDA example of the SDK (I spawn one thread per GPU and assign a chunk of the total computational domain to each GPU based on the id of the corresponding CPU handling thread).
My problem is that in order to minimize the overhead coming from the exchange of ghost nodes between CPUs, I start the computation on these nodes, start an asynchronous memory transfer from the GPU to the host and then launch the kernel that will calculate the rest of the domain for each GPU in order to hide the slow PCI-Express memory transfer behind the kernel calculations on the large remaining domain.

The above is a standard practice to get the most out of a multiGPU code and has been referenced in various papers (if people are interested I can give references later).

The problem is that for this asynchronous operations I use streams, and every CPU thread controlling a GPU has its own streams. However, when I need to exchange the data that comes from different GPUs I need to synchronize the streams to make sure the memory transfer is complete. Here is the problem: how can I synchronize streams that were created on another GPU and controlled by another CPU?
Is there a way around this? If there is someone that can help, I can post some pseudo-code to give a better idea

Any help would be greatly appreciated!!!

Thanks in advance,