Events Handling Between GPU Kernel Thread and OpenACC offloaded functions (kernels)

Hello, I am new to OpenACC and GPU programming on the NVIDIA Jetson AGX Xavier. I am developing an application that includes a GPU kernel thread (using CUDA) and a function offloaded to the GPU using the OpenACC kernels pragma. I need to synchronize both functions using events. Is there an event handling mechanism that can be used between them?

cudaEventRecord should work correctly after an openacc parallel or kernels region. Once you have that event recorded, you should be able to use e.g. cudaEventSynchronize() or cudaStreamWaitEvent() to synchronize as desired.

1 Like

Thank you, @Robert_Crovella, for your response.

I am working on an application for the Jetson AGX Xavier, where I aim to utilize the GPU for continuous data processing using OpenACC. I have two functions that I plan to offload to the GPU using OpenACC. These functions will continuously wait for events to process data. Additionally, I have a GPU thread dedicated to managing buffer sharing between the CPU and GPU.

  • GPU Thread: Continuously waits for an event from the CPU. When the CPU fills the buffer and sends an event, the GPU Thread copies the data from the CPU buffer to another buffer on the GPU and triggers an event for Function 1.
  • Function 1: Waits for a data buffer from the GPU Thread via an event. Once triggered, it copies the data from the buffer, processes it, and sends an event to Function 2 on the GPU.
  • Function 2: Waits for the event from Function 1. Upon receiving the event, it processes the data and, after completing the processing, sends the buffer back to the CPU using an event.

My questions are:

  1. Is it possible to handle events using the CUDA events API between the host (CPU) and the GPU Thread, as well as between Function 2 and the CPU?
  2. Can event handling between the GPU Thread and Function 1, as well as between Function 1 and Function 2 (both offloaded using OpenACC kernels pragmas), be implemented effectively?
  3. Furthermore, is it feasible for all processing, including event handling between the GPU Thread, Function 1, and Function 2, to remain entirely on the GPU without involving the CPU once the data has been transferred?

Does this approach seem correct?

A GPU CUDA thread has no knowledge of events (that are created and recorded via a host call to cudaEventCreate/cudaEventRecord), nor any access to them.

1 Like

@Robert_Crovella Thank you again.

What if we have two functions, Function 1 and Function 2, offloaded to the GPU using #pragma acc kernels? Can we use cudaEventRecord in Function 1 to record an event and cudaEventSynchronize in Function 2 for synchronization?

In my understanding, the code within #pragma acc kernels is converted for GPU execution. Is this approach feasible with OpenACC and CUDA?

Yes, that should be possible.

When using events, its necessary that an event be actually recorded before you begin waiting on it. Sometimes this is confusing.

If you have done a cudaEventCreate() on an event, but have not yet done a cudaEventRecord() on that event, and you launch e.g. cudaEventSynchronize() or cudaStreamWaitEvent() in the same or another thread, there won’t be the expected synchronization. Either mechanism will “complete” immediately.

Typically in CUDA we use streams to organize/express basic dependencies. If you want kernel (e.g. #pragma acc kernels) functions to run in order, the usual approach would be to launch them into the same stream.

Before we get too far into this, I also want to mention that I usually suggest folks post OpenACC questions on the compiler forum for OpenACC. You’re welcome to post here, but Mat who spends time in that other forum is much more knowledgeable about OpenACC than I am. In this case you may be straddling topics, so do whatever makes sense to you.

1 Like

Thank you, @Robert_Crovella , for the clarification.

I will also try posting on the compiler forum to gather additional insights.

In my case, function 1 and function 2 need to run in parallel such that function 1 starts processing a new buffer immediately after sending the current buffer to function 2. This way, both functions will operate simultaneously, but each will work on separate buffers. I believe the same stream mechanism and event handling may not work effectively in this scenario.

CUDA does not guarantee simultaneity, although it does establish requirements to be met to achieve it. One of those requirements would be that function 1 and function 2 would have to be launched into separate streams. However, at the same time, it seems that function 2 probably should not start until its buffer has been sent by function 1. It’s not clear why such an arrangement would require simultaneity, but putting that aside, kernel to kernel communication is generally frowned on and might not be a great design for beginners to tackle.