There is no implicit barrier at the end of a kernel function or in a cudaMemcpy(). Instead, kernels, cudaMemcpy()s and cudaMemcpyAsync()s in the same stream are executed sequentially with respect to each other, but (with the exception of cudaMemcpy()) asynchronously to the host code.
So you cannot assume all kernels have finished at the end of a host function (which by the way has not special meaning to CUDA), you need to explicitly put a cudaStreamSynchronize() or cudaDeviceSynchronize() there.
Yes, that looks like a good barrier assuming you run CUDA 4.1, each device is handled by one OpenMP thread, and you want to sync all devices and all threads. If you just want to sync the GPUs without caring about the host threads, you can use [font=“Courier New”]cudaStreamWaitEvent()[/font].
I am a little bit confused, but if the there are no implcit barrier when cudaMemcpy is executed it means that I woud have to call some other sync command to be sure that the data arrived on the host. The cudamemcpy is a blocking command, nothing can be executed in the code until this command is done, so it does act as a barrier.
This means it should work somethinglike this:
and it will have all processes stop at the barrier and all the kernels will be executed by the time the cudamemcpy is finished.
The cudaMemcpy blocks also all execution of the code also in streams, so they should never be used with stremas, one should rather use cudaMemcpyAsync() with the streams.