I am currently running some work in multi-gpu what I would like to synchronise directly on the gpu. As I understand the events and streams are per gpu, so I we need to device synchronise on the host before issuing commands. Is that correct ?
Please let me know if you have any best practices.
cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.