I have some questions about the sync API in cuda, mostly because the documentation is not very clear (or I did not find a clear one!). Note that I am using the driver api but it should be the same for cuda.
cuEventSynchronize: the docs says this: “Waits until the completion of all work currently captured in event.” I undestand that this blocks the host thread but, on what stream does it wait? The creation of the event(cuEventRecord) has a param for the stream. So, is the stream used to create the even somehow stored in the event itself so, it ‘knows’ the stream on which to wait?
cudaEventCreate ‘creates’ events. cudaEventRecord ’ records’ an event, but recording an event is, what you most likely meant.
cudaEventRecord is put into the specified stream as any other asynchronous operation. So it is the opposite: Not the stream is stored in the event, but the recording of an event is put into the stream. As soon as the stream arrives there (and not is finished or is empty), the event activates.
cudaEventSynchronize blocks the host thread, cudaStreamWaitEvent blocks a stream (or puts the block into a stream).
Thanks for the replay! Yes, I meant cuEventRecond! I think now I understand how it works.
Still, an extra q: ‘cuEventQuery’ docs states that the status can be “not finished” and in that case one can wait with cuEventSynchronize, can it be also used with cuStreamWaitEvent? Is the behavoir the same?
Here is how I would put some short docs for the sync APIs in cuda:
…
cuEventSynchronize(CUevent hEvent)
blocks the host thread until the ‘hEvent’ is activated.
the stream where the event was created, is not relevant to this call, because the api does not reference it.
cuStreamWaitEvent(CUstream hStream, CUevent hEvent)
blocks all new work on the ‘hStream’ until the ‘hEvent’ is activated
the stream where the event was created, is not relevant to this call because ‘hStream’ referes to potentially a new stream not the stream which created ‘hEvent’.
cuStreamSynchronize(CUstream hStream)
wait until a stream’s tasks are completed.
most likely the events associated with this stream are also activated but not relevant to this call
…
The host thread (in this case) waits, until any stream records this specific event. So the correct stream has to record the event at the correct moment within the stream.
Do you know binary semaphores? Those operations implement a binary semaphore logic.
cudaEventRecord is the signal operation, often called V.
cudaStreamWaitEvent is the wait operation, often called P.
Ok I think I am now close to understanding how cuda sync works.
That is what I was thinking when I said that the steam does not matter in that case because there is not a stream param to the sync function. The api will check all streams for an even and then block on that.
regarding cudaStreamWaitEvent, this api takes a stream and an event as params but, most of the time, when it is called, they are NOT related. Here is an example from cuda samples (streamOrder…p2p.cu):
First they activate the even like this:
checkCudaErrors(cudaEventRecord(waitOnStream1, stream1));
Note that the even is recorded on stream1.
then later they wait for the event ‘waitOnStream1’ but NOT on stream1 where it was recorded, but on stream2 like this:
This is how we use most of the time this api: record the even on stream1 but then stop any work on stream2 until the even is ‘seen’. It makes sense because for example if the even is tied to a copy and stream2 is to a kernel, you want to make sure that the data is ready before the kernel is started.
For me this was a great conversation thank you.
BTW, I think nvidia should step in and clarify all this with 1/2 page of documentation. It will help a lot!
It the event would be recorded on the same stream as it is waited upon, a stream would wait upon itself, which either is always true (no-op) or would block the stream indefinitely.
Not knowing how it works or seeing examples, the documentation is not very clear, I agree.
There are some fine details, what happens, when an event is first recorded and then waited for, or under what conditions the event is rearmed/reused.