Synchronize streams on multiple GPUs without blocking the CPU


I am looking for a way to synchronize streams running on different GPUs without blocking the CPU. I am using Optix to trace an image on multiple GPUs but after the trace has finished I need to copy the results to an OpenGL texture for display which has to be done on the main thread and only on one GPU. At the moment I am using cuStreamSynchronize but that blocks my main Thread. So my idea was to use Events instead but that doesn´t seem to work for some reason. What I did was the following
For each GPU/stream create an event:
cuEventCreate( &streamEvent[gpuIdx], CU_EVENT_DISABLE_TIMING);

Then in the render loop using one CPU Thread be GPU do:
for each GPU launch a thread running:
launch optixTrace on stream[gpuIdx]
cuEventRecord( streamEvent[gpuIdx], stream[gpuIdx])

on the main thread do
for each GPU
cuStreamWaitEvent( stream[0], streamEvent[gpuIdx], CU_EVENT_WAIT_DEFAULT)

do more work on stream[0]

The idea was that the main gpu stream waits for all the other gpu streams to finish before doing some more work just on the main gpu (in my case converting and copying the data to a vulkan/openGL interop texture and passing it to DLSS. But for some reason the cuStreamWait doesn´t seem to wait for the work on the other GPUs to be completed, giving me mixed frames.

Does anyone have an idea what I might do wrong?

You are using multiple threads. Do you ensure that cuEventRecord is performed before another thread calls cuStreamWaitEvent ?

Good point, I would say yes though. I am using a tbb::parallel_for over the number of GPUs available. The cuEventRecord is called inside the parallel_for and the parallel_for should not return before all the work is executed (as in written to the stream). I quickly did some prints and for 2 GPUs it was like I would expect it:

Event recorded on device 0 for frame 584
Event recorded on device 1 for frame 584
Waited for Event for frame 584 to finish on device 0
Waited for Event for frame 584 to finish on device 1