Threads sharing cuda events

agardiner · August 18, 2016, 1:37am

I am trying to get two pthreads to share cuda events between them.
I have one that is a producer thread that creates the cuda event indicating that data is ready to read.
I have one thread that is a consumer thread that uses cuda streams.
Ideally, the producer thread should create an event which the consumer thread waits for, using cudaStreamWaitEvent.
Right now the consumer thread never waits for the event before it starts to process the data buffer.
Is there something fundamentally flawed in this approach or am I missing something?
Thanks for your help!

Robert_Crovella · August 18, 2016, 12:59pm

cudaStreamWaitEvent won’t cause a CPU thread to wait on anything.

If you want a CPU thread to wait on the completion of an event, you should use cudaEventSynchronize()

agardiner · August 18, 2016, 6:43pm

So I tried cudaEventSynchronize() and it behaved the same.
This event gets created sometime in the future compared to when the scheduler queues up the streams in the consumer thread. So I think that what is happening is as follows:

/* If ::cudaEventRecord() has not been called on \p event, ::cudaSuccess is
returned immediately. */

Either that or the threads are not sharing the same context. How do I check that?
Thanks!

Robert_Crovella · August 18, 2016, 7:50pm

You can’t create an event after you check for it. That won’t work. The event has to be created and recorded before you check for it or wait on it.

The threads are sharing the same context. If they weren’t you’d get all sorts of errors.

agardiner · August 18, 2016, 8:20pm

It is created before I create the producer pthread.
I have delayed the consumer thread by a nanosleep so that initially the event has been recorded before I synchronize on it and still the same behavior.
Perhaps you can help my misunderstanding. The way I want it to work:

Main thread creates event0.
Producer thread is created.
Main thread sleeps.
Producer thread records event0.
Main thread loops over the following:
Main thread wakes up and calls cudaEventSynchronize(event0). 
Because the event is there, event0 is cleared and the Asynchronous calls in Main thread start.
End Loop
During this time the producer thread is recording event0 periodically.  After each time, I would like the main thread to wake up and do its thing, then go back to sleep.

What I see in the profiler timeline is that the scheduler queues up several cycles of the main thread with no pause in between.

Robert_Crovella · August 18, 2016, 8:47pm

I don’t think that’s enough information. In any event, the most likely possibility IMO is that you have an asynchrony between producer and main threads that allows main thread to hit a cudaEventSynchronize(event0) before the producer thread has recorded it.

Of course, it’s unclear what your producer thread is doing. If it is recording event0 and there is no previous CUDA activity in the stream that event0 was recorded into, then the event will complete immediately - this would probably be also indistinguishable from your observation.

agardiner · August 18, 2016, 10:08pm

Right now I am simulating what the producer thread will do, which is receive data from a NIC and buffer it into host memory. The producer thread does this:

void *producerFcn(void*)
{
  struct timespec period;
  struct timespec remain;
  int retCode = 0;
  cudaSetDevice(0);
  for (int cnt=0; cnt < NUM_CYCLES ; cnt++)
  {
    period.tv_sec = 0;
    period.tv_nsec = PRODUCER_PERIOD;
    retCode = nanosleep(&period, &remain);

    // check return code here and sleep the remaining time if needed ....

    cudaEventRecord(inputReady, streams[2]);
  }
}

Where “inputReady” is a global cudaEvent_t variable created before the producer thread exists, streams[2] is a global cudaStream_t that has also been created in the Main thread’s initialization.

The Main thread does this in a loop:

for (cycleIdx = 0; cycleIdx < NUM_CYCLES; cycleIdx++)
{
  cudaEventSynchronize(inputReady);
  kernel1<<<...,streams[0]>>>(args);  // pipeline of data has already been filled for this cycle
  kernel2<<<...,streams[0]>>>(args);
  cudaMemcpyAsync(deviceInputPtr, hostInputPtr,     // fill data for next cycle
    sizeof(float)*dataLength,cudaMemcpyHostToDevice,
    streams[2]);
  cudaMemcpyAsync(hostOutputPtr, deviceOutputPtr,  
    sizeof(float)*dataLength, cudaMemcpyDeviceToHost, 
    streams[1]);
}

There are multiple buffers on the input and output so that there are no race conditions. There are events within the Main thread and cudaStreamWaitEvent() calls so that memcpys do not happen before they are supposed to.
I know that there is an asynchrony between the threads, that is why I would like the main thread to wait for the producer event (inputReady) to occur.

Robert_Crovella · August 19, 2016, 6:36am

It certainly looks to me like the cudaEventSynchronize can happen before the cudaEventRecord.

I would point out again that trying to use a cudaEventRecord for thread synchronization makes no sense when there is no previous CUDA activity issued into the stream that the cudaEventRecord is issued into (streams[2]).

Prior to the cudaEventRecord in the producer thread, if the main thread does a cudaEventSynchronize on inputReady, it will simply continue.

At the moment the cudaEventRecord is issued into streams[2] by the producer thread, that cuda event will immediately complete because there is no prior CUDA activity issued into that stream.

The net of this is that any time the main thread encounters cudaEventSynchronize on inputReady, it will simply proceed. It will never stop there. Which seems to be your observation. It doesn’t surprise me.

The mechanism looks broken to me. Since you’re not issuing any cuda activity in streams[2], my suggestion would be to use ordinary CPU thread based synchronization methods to control the main thread waiting on the producer thread.

agardiner · August 19, 2016, 5:29pm

Thanks, txbob. From the profiler using the extensions, it seemed that these "event record"s were happening right after the timer went off. The consumer’s cuda calls were scheduled several milliseconds before the events occurred so the cuda calls issued without waiting. This seems to be a bug to me in the event synchronization between pthreads.
Yesterday, I added in semaphores instead of the cuda mechanism. It works much better that way. I.e. the scheduler is held off because of the semaphore so that it only schedules a cycle’s worth of cuda calls at a time.
Anyway, case closed.

Topic		Replies	Views
Threading and streams cudaStreamSynchronize CUDA Programming and Performance	4	3622	July 16, 2008
Events in different contexts CUDA Programming and Performance	1	5075	January 5, 2010
Event Synchronization CUDA Programming and Performance	6	1844	February 8, 2019
Is the driver API thread safe? Specifically the cuStreamSynchronize and cuEventSynchronize functions CUDA Programming and Performance	0	1301	January 7, 2009
How to block a single host thread on a CUDA event multi-threaded CUDA application CUDA Programming and Performance	5	2222	September 8, 2011
cudaEvents CUDA Programming and Performance	1	1041	October 17, 2014
OpenMP thread communication for device operations CUDA Programming and Performance	2	946	April 9, 2014
Behaviour of Multithreaded programs with cudaThreadSynchronize() The semantics of cudaThreadSynchron CUDA Programming and Performance	1	7217	January 9, 2012
Asynchronous Stream Synchronize CUDA Programming and Performance	2	1206	June 30, 2012
Some questions regarding concurrency CUDA Programming and Performance	0	920	May 25, 2010

Threads sharing cuda events

Related topics