I am currently running some work in multi-gpu what I would like to synchronise directly on the gpu. As I understand the events and streams are per gpu, so I we need to device synchronise on the host before issuing commands. Is that correct ?

Please let me know if you have any best practices.

Dependencies between streams of different gpus are no different than those betweens streams on the same gpu.

cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.

