How to detect async event without polling

Hi. I’m trying to find a way of detecting async event without using host CPU’s polling. In NVIDIA CUDA GPU Computing SDK, there is AsyncAPI project (Please see below.)

As you can see, the last part is CPU polling to detect the recording of the event. Is there any more efficient way to associate async event with an event handler or callback function, for example?
Ultimately, we’d like to detect a completion of GPU kernels (in this example, increment_kernel) without polling. We thought of using cudaThreadSynchronize(), but when we asynchronously run multiple GPU kernels, cudaThreadSynchronize waits until all kernels are complete, instead of individual kernel.

I appreciate your comment.

// create cuda event handles
cudaEvent_t start, stop;
cutilSafeCall( cudaEventCreate(&start) );
cutilSafeCall( cudaEventCreate(&stop) );

unsigned int timer;
cutilCheckError(  cutCreateTimer(&timer)  );
cutilCheckError(  cutResetTimer(timer)    );
cutilSafeCall( cudaThreadSynchronize() );
float gpu_time = 0.0f;

// asynchronously issue work to the GPU (all to stream 0)
cutilCheckError( cutStartTimer(timer) );
    cudaEventRecord(start, 0);
    cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);
    increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);
    cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);
    cudaEventRecord(stop, 0);
cutilCheckError( cutStopTimer(timer) );

// have CPU do some work while waiting for stage 1 to finish
unsigned long int counter=0;
while( cudaEventQuery(stop) == cudaErrorNotReady )
{
    counter++;
}

There is a cudaEventSynchronize() function which blocks until a particular event is completed. I don’t think there is any functionality like select() to wait on multiple CUDA events simulataneously and return when any one of them finishes.

There is a cudaEventSynchronize() function which blocks until a particular event is completed. I don’t think there is any functionality like select() to wait on multiple CUDA events simulataneously and return when any one of them finishes.

Thank ou Seibert,
Unfortunately cudaEventSynchronize() waits and blocks for a particular event. If we want to detect events individually, I’m afraid I have to spawn thread per event and call this EventSynchronize() inside each thread. It adds overhead, but do you see any other approach?

Our goal is to detect the comletion of multiple GPU kernels, so another approach that I can think of is to launch the GPU kernel in different stream, and spawn thread per kernel. In each thread, we call blocking cudaStreamSynchronize() instead of event.

Is there any significant resource difference between event and stream approaches? In either way, it seams ttahah I need to create thread per GPU kernel or event (we’re using Windows, and this approach sounds too complex). Since there is no select() or any sort of function that return single event completion from multiple events, that’s the only approach that I can think of…

If you have simpler approach, I appreciate your feedback.

Thank ou Seibert,
Unfortunately cudaEventSynchronize() waits and blocks for a particular event. If we want to detect events individually, I’m afraid I have to spawn thread per event and call this EventSynchronize() inside each thread. It adds overhead, but do you see any other approach?

Our goal is to detect the comletion of multiple GPU kernels, so another approach that I can think of is to launch the GPU kernel in different stream, and spawn thread per kernel. In each thread, we call blocking cudaStreamSynchronize() instead of event.

Is there any significant resource difference between event and stream approaches? In either way, it seams ttahah I need to create thread per GPU kernel or event (we’re using Windows, and this approach sounds too complex). Since there is no select() or any sort of function that return single event completion from multiple events, that’s the only approach that I can think of…

If you have simpler approach, I appreciate your feedback.

I’m not even sure this works, given the association of a CUDA context to a specific host thread. If you submit a bunch of kernels to the device on one host thread, then spawn several more host threads to block on different events, I don’t think the CUDA functions will work correctly in those additional threads. I think polling may be your only option to detection completion of many events independently.

If you have an NVIDIA developer account, you should file a feature request (that’s possible, right?) explaining your need for a cudaEventSelect() function. It does not appear that there is any way for us to create our own version of such a function.

I’m not even sure this works, given the association of a CUDA context to a specific host thread. If you submit a bunch of kernels to the device on one host thread, then spawn several more host threads to block on different events, I don’t think the CUDA functions will work correctly in those additional threads. I think polling may be your only option to detection completion of many events independently.

If you have an NVIDIA developer account, you should file a feature request (that’s possible, right?) explaining your need for a cudaEventSelect() function. It does not appear that there is any way for us to create our own version of such a function.

Hi Seibert,

I found interesting sample project, called “CUDA Context Thread Management” in NVidia CUDA Computing GPU SDK. In threadMigration.cpp, I can see that we can create windows thread per Cuda kernel. Pretty neat.

It looks like we can create windows thread within cuda project, so we can use this thread to detect the completion of event (cudaEventSynchronize).

Here is what I’m thinking to implement:

  1. Start GPU Kernel (asynchronous, non-blocking) and issue EventRecord

    gpu_kernel_i <<< blocks, threads, size, stream_i >>> (parameters) //non-blocking
    cudaEventRecord(kernel_i_done_event, stream_i); //non-blocking

  2. Create Windows Thread and use blocking cudaEventSynchronize inside the thread

    void ThreadProc()
    {
    cudaEventSynchronize(kernel_i_done_event); //This blocks until the event completes

     //send windows event notification to upper level of API or GUI
    

    } //Exit thread

Do you think this is feasible?? Another possibility is

  1. Start GPU Kernel (asynchronous, non-blocking) in each steam

    gpu_kernel_i <<< blocks, threads, size, stream_i >>> (parameters) //non-blocking

  2. Create Windows Thread and use blocking cudaStreamSynchronize inside the thread

    void ThreadProc()
    {
    cudaStreamSynchronize(stream_i ); //This blocks until all operation of the stream complete

     //send windows event notification to upper level of API or GUI
    

    } //Exit thread

I’m not sure which is more efficient approach: event or steam partition.

Hi Seibert,

I found interesting sample project, called “CUDA Context Thread Management” in NVidia CUDA Computing GPU SDK. In threadMigration.cpp, I can see that we can create windows thread per Cuda kernel. Pretty neat.

It looks like we can create windows thread within cuda project, so we can use this thread to detect the completion of event (cudaEventSynchronize).

Here is what I’m thinking to implement:

  1. Start GPU Kernel (asynchronous, non-blocking) and issue EventRecord

    gpu_kernel_i <<< blocks, threads, size, stream_i >>> (parameters) //non-blocking
    cudaEventRecord(kernel_i_done_event, stream_i); //non-blocking

  2. Create Windows Thread and use blocking cudaEventSynchronize inside the thread

    void ThreadProc()
    {
    cudaEventSynchronize(kernel_i_done_event); //This blocks until the event completes

     //send windows event notification to upper level of API or GUI
    

    } //Exit thread

Do you think this is feasible?? Another possibility is

  1. Start GPU Kernel (asynchronous, non-blocking) in each steam

    gpu_kernel_i <<< blocks, threads, size, stream_i >>> (parameters) //non-blocking

  2. Create Windows Thread and use blocking cudaStreamSynchronize inside the thread

    void ThreadProc()
    {
    cudaStreamSynchronize(stream_i ); //This blocks until all operation of the stream complete

     //send windows event notification to upper level of API or GUI
    

    } //Exit thread

I’m not sure which is more efficient approach: event or steam partition.

Why not try your own polling loop? I know you said “I don’t want to use polling” but it’s still the most obvious solution.

Polling gives you flexibility in deciding what you want to wait for (including multiple streams, multiple GPUs, timeouts, even hostside events), plus the nice side benefit of (much!) lower hostside CPU use if your kernels aren’t tiny. The disadvantage of manual polling is extra hostside latency if your kernels are fast, like 2 milliseconds or less. But since you’re launching many streams, even that latency may be hidden since you have multiple queues feeding the GPU anyway, so the device is not going to idle.

Why not try your own polling loop? I know you said “I don’t want to use polling” but it’s still the most obvious solution.

Polling gives you flexibility in deciding what you want to wait for (including multiple streams, multiple GPUs, timeouts, even hostside events), plus the nice side benefit of (much!) lower hostside CPU use if your kernels aren’t tiny. The disadvantage of manual polling is extra hostside latency if your kernels are fast, like 2 milliseconds or less. But since you’re launching many streams, even that latency may be hidden since you have multiple queues feeding the GPU anyway, so the device is not going to idle.

Hello SPWorley,

Unfortunately, we plan many GPU kernels running simultanenously, and we need our host CPU to focus on rather important decision making tasks than polling… (continuous video processing application).

By looking at “threadMigration” project (GPU Computing GPU SDK), it looks like the right approach is GPU context management API. Using this API, we can associate kernel context with windows or linux thread object. So, we do not really need event at all. We can use WaitForMultipleObjects to wait for thread signals. All answers were there…

Hello SPWorley,

Unfortunately, we plan many GPU kernels running simultanenously, and we need our host CPU to focus on rather important decision making tasks than polling… (continuous video processing application).

By looking at “threadMigration” project (GPU Computing GPU SDK), it looks like the right approach is GPU context management API. Using this API, we can associate kernel context with windows or linux thread object. So, we do not really need event at all. We can use WaitForMultipleObjects to wait for thread signals. All answers were there…

Manual polling uses less CPU than the CUDA synchronize functions, which tightly spin a CPU thread as fast as possible to minimize latency in reacting to kernel completion.

Manual polling uses less CPU than the CUDA synchronize functions, which tightly spin a CPU thread as fast as possible to minimize latency in reacting to kernel completion.

SPWorley,
Thank you for your feedback. This indeeds changes the whole story. Can you direct me to a reference doc, paper or other active threads that show the CPU overhead of synchronize function? I am new to GPU world.

SPWorley,
Thank you for your feedback. This indeeds changes the whole story. Can you direct me to a reference doc, paper or other active threads that show the CPU overhead of synchronize function? I am new to GPU world.

The programming guide and CUDA reference just say the functions “block until completion” and don’t explicitly discuss the 100% CPU use of the synchronize() functions likely because it’s something that could change in the future.

A simple but effective manual polling uses events and cudaEventQuery to test, something similar to:

myKernel<<<a,b>>>(myArgs);

cudaEventRecord(event, 0);

do msleep(50);  /* some number of milliseconds significantly shorter than your expected kernel run time */

while (cudaErrNotReady==cudaEventQuery(event));

The decision in manual polling here is the sleep time… if you have very short kernels (millisecond or less ) then it’s tough to manually sleep a thread that short of a time efficiently.

A longer sleep uses less CPU (though CPU use is effectively 0% for anything more than roughly 10ms) but the tradeoff is you’ll have latency in reacting to the kernel completion based on that sleep duration, which may or may not be important to your app.

The programming guide and CUDA reference just say the functions “block until completion” and don’t explicitly discuss the 100% CPU use of the synchronize() functions likely because it’s something that could change in the future.

A simple but effective manual polling uses events and cudaEventQuery to test, something similar to:

myKernel<<<a,b>>>(myArgs);

cudaEventRecord(event, 0);

do msleep(50);  /* some number of milliseconds significantly shorter than your expected kernel run time */

while (cudaErrNotReady==cudaEventQuery(event));

The decision in manual polling here is the sleep time… if you have very short kernels (millisecond or less ) then it’s tough to manually sleep a thread that short of a time efficiently.

A longer sleep uses less CPU (though CPU use is effectively 0% for anything more than roughly 10ms) but the tradeoff is you’ll have latency in reacting to the kernel completion based on that sleep duration, which may or may not be important to your app.

Isn’t this behavior controlled by the cudaSetDeviceFlags() function? It looks like you have a choice between spin, yield (presumably inside polling loop), and blocking on a CPU synchronization primitive. I haven’t seen anyone benchmark the latency on these, so I’m not sure how the last two compare.