Threads sharing cuda events

I am trying to get two pthreads to share cuda events between them.
I have one that is a producer thread that creates the cuda event indicating that data is ready to read.
I have one thread that is a consumer thread that uses cuda streams.
Ideally, the producer thread should create an event which the consumer thread waits for, using cudaStreamWaitEvent.
Right now the consumer thread never waits for the event before it starts to process the data buffer.
Is there something fundamentally flawed in this approach or am I missing something?
Thanks for your help!

cudaStreamWaitEvent won’t cause a CPU thread to wait on anything.

If you want a CPU thread to wait on the completion of an event, you should use cudaEventSynchronize()

So I tried cudaEventSynchronize() and it behaved the same.
This event gets created sometime in the future compared to when the scheduler queues up the streams in the consumer thread. So I think that what is happening is as follows:

/* If ::cudaEventRecord() has not been called on \p event, ::cudaSuccess is
returned immediately. */

Either that or the threads are not sharing the same context. How do I check that?
Thanks!

You can’t create an event after you check for it. That won’t work. The event has to be created and recorded before you check for it or wait on it.

The threads are sharing the same context. If they weren’t you’d get all sorts of errors.

It is created before I create the producer pthread.
I have delayed the consumer thread by a nanosleep so that initially the event has been recorded before I synchronize on it and still the same behavior.
Perhaps you can help my misunderstanding. The way I want it to work:

Main thread creates event0.
Producer thread is created.
Main thread sleeps.
Producer thread records event0.
Main thread loops over the following:
Main thread wakes up and calls cudaEventSynchronize(event0). 
Because the event is there, event0 is cleared and the Asynchronous calls in Main thread start.
End Loop
During this time the producer thread is recording event0 periodically.  After each time, I would like the main thread to wake up and do its thing, then go back to sleep.

What I see in the profiler timeline is that the scheduler queues up several cycles of the main thread with no pause in between.

I don’t think that’s enough information. In any event, the most likely possibility IMO is that you have an asynchrony between producer and main threads that allows main thread to hit a cudaEventSynchronize(event0) before the producer thread has recorded it.

Of course, it’s unclear what your producer thread is doing. If it is recording event0 and there is no previous CUDA activity in the stream that event0 was recorded into, then the event will complete immediately - this would probably be also indistinguishable from your observation.

Right now I am simulating what the producer thread will do, which is receive data from a NIC and buffer it into host memory. The producer thread does this:

void *producerFcn(void*)
{
  struct timespec period;
  struct timespec remain;
  int retCode = 0;
  cudaSetDevice(0);
  for (int cnt=0; cnt < NUM_CYCLES ; cnt++)
  {
    period.tv_sec = 0;
    period.tv_nsec = PRODUCER_PERIOD;
    retCode = nanosleep(&period, &remain);

    // check return code here and sleep the remaining time if needed ....

    cudaEventRecord(inputReady, streams[2]);
  }
}

Where “inputReady” is a global cudaEvent_t variable created before the producer thread exists, streams[2] is a global cudaStream_t that has also been created in the Main thread’s initialization.

The Main thread does this in a loop:

for (cycleIdx = 0; cycleIdx < NUM_CYCLES; cycleIdx++)
{
  cudaEventSynchronize(inputReady);
  kernel1<<<...,streams[0]>>>(args);  // pipeline of data has already been filled for this cycle
  kernel2<<<...,streams[0]>>>(args);
  cudaMemcpyAsync(deviceInputPtr, hostInputPtr,     // fill data for next cycle
    sizeof(float)*dataLength,cudaMemcpyHostToDevice,
    streams[2]);
  cudaMemcpyAsync(hostOutputPtr, deviceOutputPtr,  
    sizeof(float)*dataLength, cudaMemcpyDeviceToHost, 
    streams[1]);
}

There are multiple buffers on the input and output so that there are no race conditions. There are events within the Main thread and cudaStreamWaitEvent() calls so that memcpys do not happen before they are supposed to.
I know that there is an asynchrony between the threads, that is why I would like the main thread to wait for the producer event (inputReady) to occur.

It certainly looks to me like the cudaEventSynchronize can happen before the cudaEventRecord.

I would point out again that trying to use a cudaEventRecord for thread synchronization makes no sense when there is no previous CUDA activity issued into the stream that the cudaEventRecord is issued into (streams[2]).

Prior to the cudaEventRecord in the producer thread, if the main thread does a cudaEventSynchronize on inputReady, it will simply continue.

At the moment the cudaEventRecord is issued into streams[2] by the producer thread, that cuda event will immediately complete because there is no prior CUDA activity issued into that stream.

The net of this is that any time the main thread encounters cudaEventSynchronize on inputReady, it will simply proceed. It will never stop there. Which seems to be your observation. It doesn’t surprise me.

The mechanism looks broken to me. Since you’re not issuing any cuda activity in streams[2], my suggestion would be to use ordinary CPU thread based synchronization methods to control the main thread waiting on the producer thread.

Thanks, txbob. From the profiler using the extensions, it seemed that these "event record"s were happening right after the timer went off. The consumer’s cuda calls were scheduled several milliseconds before the events occurred so the cuda calls issued without waiting. This seems to be a bug to me in the event synchronization between pthreads.
Yesterday, I added in semaphores instead of the cuda mechanism. It works much better that way. I.e. the scheduler is held off because of the semaphore so that it only schedules a cycle’s worth of cuda calls at a time.
Anyway, case closed.