there’s no wait for multiple events, but in general you can just call that in order and eventually it will work. however, keep in mind that to avoid deadlock we require that the event be recorded before you try to query or synchronize on it.
Hm, how to ensure this condition in a multi-threaded application? The event will be recorded from one thread and captured from another. What I actually need is some synchronization mechanism between threads on a completion of some GPU work, such as a kernel execution or an async memcpy. My idea was to have something like this:
T1:
...some CPU work
issue some GPU work in stream SX
issue CUDA eventX in SX
...
T2:
issue some GPU work in stream SY
issue CUDA eventY in SY
T3:
cudaEventSynchronize(eventX)
cudaEventSynchronize(eventY)
counterEventX++
counterEventY++
...some CPU or GPU work
p.s. Is there a possibility to capture every occurence of an event EX? Or is there something like a fast atomic increment of a host variable from a CUDA stream? I’m looking for a way to implement a semaphore that signals when some (asynchronously issued) GPU work has completed.