OpenACC + CUDA implementation

I am working on an application for the Jetson AGX Xavier, where I aim to utilize the GPU for continuous data processing using OpenACC. I have two functions that I plan to offload to the GPU using OpenACC. These functions will continuously wait for events to process data. Additionally, I have a GPU thread dedicated to managing buffer sharing between the CPU and GPU.

  • GPU Thread: Continuously waits for an event from the CPU. When the CPU fills the buffer and sends an event, the GPU Thread copies the data from the CPU buffer to another buffer on the GPU and triggers an event for Function 1.
  • Function 1: Waits for a data buffer from the GPU Thread via an event. Once triggered, it copies the data from the buffer, processes it, and sends an event to Function 2 on the GPU.
  • Function 2: Waits for the event from Function 1. Upon receiving the event, it processes the data and, after completing the processing, sends the buffer back to the CPU using an event.
  • (Both functions and GPU thread need to be running in parallel. In such as way, after sending the buffer to next stage, the function will start working on the next buffer)

My questions are:

  1. Is it possible to handle events using the CUDA events API between the host (CPU) and the GPU Thread, as well as between Function 2 and the CPU?
  2. Can event handling between the GPU Thread and Function 1, as well as between Function 1 and Function 2 (both offloaded using OpenACC kernels pragmas), be implemented effectively?
  3. Furthermore, is it feasible for all processing, including event handling between the GPU Thread, Function 1, and Function 2, to remain entirely on the GPU without involving the CPU once the data has been transferred?

Does this approach seem correct?

Hi muhammad.ahmed3,

Unfortunately I don’t have experience in this area to give you advice on the algorithmic details. You might consider posting your questions (or moving this post) to the CUDA programming forum.

Though I will say that OpenACC is interoperable with CUDA and nvc++ does have limited support for compiling CUDA directly.

-Mat

1 Like

Hi @MatColgrove Thank you for your response.

I have built my C++ code with CUDA threads and Functions offloaded on GPU using OpenACC. I am facing a warning related to the limited support of nvc++ for CUDA that you described in the previous comment.

Is there any other compiler available that can be used to compile this type of implementation with CUDA threads and OpenACC?

I believe nvc++ is the only compiler that can do both CUDA and OpenACC.

Do you have a reproducing example program you can share and well as more details on the warning? While I’m not able to help with your algorithm, if I have an example I might be able to help you work through the issue, or at least know who to ask.

Hi @MatColgrove I have created two GPU kernels such as gpu_kernel1 and gpu_kernel2. Also I have offloaded a function stream_demux1 on GPU using openacc asynchronously.
I am getting following warning when compiling my code using nvc++ compiler such as (nvc++ -acc -cuda cuda_openacc.cu)

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////
“cuda_openacc.cu”, line 262: warning: CUDA C++ compilation is unsupported in nvc++; nvcc is the recommended CUDA compiler [cuda_compile]
gpu_kernel1<<<1, 1, 0, stream1>>>(gpu_input_buffer_1, gpu_input_buffer_2, stream_demux_input_buffer);
^

Remark: individual warnings can be suppressed with “–diag_suppress ”

“cuda_openacc.cu”, line 263: warning: CUDA C++ compilation is unsupported in nvc++; nvcc is the recommended CUDA compiler [cuda_compile]
gpu_kernel2<<<1, 1, 0, stream2>>>(gpu_output_buffer_1, gpu_output_buffer_2, stream_demux_output_buffer);
^

“cuda_openacc.cu”, line 279: warning: statement is unreachable [code_is_unreachable]
DeallocateGPUBuffers();
^

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Here is the example program in cuda_openacc.cu file
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
#include < iostream>
#include <stdio.h>
#include <unistd.h>
#include <zmq.h>
#include < cmath>
#include < string>
#include < cstring>
#include  < iomanip>
#include < sys/mman.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <openacc.h>
#include <device_launch_parameters.h>

using namespace std;

/****************MACROS DEFINITIONS************************************/

#define DEBUG 					0
#define BUFFER_SIZE 			64*1024 // 64kBytes
#define UWord8  				unsigned char
#define UWord16 				unsigned short
#define UWord32					unsigned int
#define LOCALHOST_IP 			string("tcp://127.0.0.1:")

// Fixed parameters for Stream Demux Block
#define LENGTH1 				24 // Length of the first output
#define LENGTH2 				4  // Length of the second output
#define TOTAL_LENGTH 			(LENGTH1 + LENGTH2) // Total length per cycle
#define VECTOR_LENGTH 			1 // Single items, no vectors
#define BUFFER_FILLED 			1
#define BUFFER_EMPTY  			2
#define GPU_CLOSE  				0
/**********************************************************************/

/*****************GLOBAL VARIABLES*************************************/
UWord8* gpu_input_buffer_1;
UWord8* gpu_input_buffer_2;
UWord8* stream_demux_input_buffer;
UWord8* stream_demux_output_buffer;
UWord8* gpu_output_buffer_1;
UWord8* gpu_output_buffer_2;

//Unified Memory Shared Between CPU & GPU
__device__ __managed__ volatile unsigned int gpu_input_buffer_1_flag = 0;
__device__ __managed__ volatile unsigned int gpu_input_buffer_2_flag = 0;
__device__ __managed__ volatile unsigned int gpu_output_buffer_1_flag = 0;
__device__ __managed__ volatile unsigned int gpu_output_buffer_2_flag = 0;
__device__ __managed__ volatile unsigned int gpu_input_buffer_1_size = 0;
__device__ __managed__ volatile unsigned int gpu_input_buffer_2_size = 0;
__device__ __managed__ volatile unsigned int gpu_output_buffer_1_size = 0;
__device__ __managed__ volatile unsigned int gpu_output_buffer_2_size = 0;

//GPU Memory
__device__ volatile unsigned int strm_dmx_input_buffer_flag = 0;
__device__ volatile unsigned int strm_dmx_input_buffer_size = 0;
__device__ volatile unsigned int strm_dmx_output_buffer_flag = 0;
__device__ volatile unsigned int strm_dmx_output_buffer_size = 0;

/**********************************************************************/
__device__ void gpuSleep(unsigned int cycles)
{
    unsigned int start = clock();  // Get current clock cycle //675 Mhz
    while (clock() - start < cycles) {
        // Busy-wait
    }
}

void stream_demux1(UWord8* stream_demux_input_buffer, UWord8* stream_demux_output_buffer)
{
	size_t num_cycles = 0;
	UWord32 done = 0;

	printf("OpenACC Stream Demux Function r\n");
	
	#pragma acc data present(strm_dmx_input_buffer_flag, strm_dmx_output_buffer_flag)
	#pragma acc data present(strm_dmx_input_buffer_size, strm_dmx_output_buffer_size) 
	
	while(1)
	{
		if(strm_dmx_input_buffer_flag == BUFFER_FILLED)
		{
	    	do
	    	{
//	    		usleep(10);
	    	}while(strm_dmx_output_buffer_flag == BUFFER_FILLED);

				if(strm_dmx_output_buffer_flag == BUFFER_EMPTY)
				{
				    num_cycles = strm_dmx_input_buffer_size / TOTAL_LENGTH;

				    for (size_t i = 0; i < num_cycles; i++)
				    {
				    	memcpy(stream_demux_output_buffer + i * LENGTH1, stream_demux_input_buffer + i * TOTAL_LENGTH, LENGTH1);
				 //   	memcpy(output2 + i * LENGTH2, input + i * TOTAL_LENGTH + LENGTH1, LENGTH2);
				    }

					strm_dmx_output_buffer_size = num_cycles * LENGTH1;
					strm_dmx_output_buffer_flag = BUFFER_FILLED;
					strm_dmx_input_buffer_flag = BUFFER_EMPTY;
				}
			}
			else if (strm_dmx_input_buffer_flag == GPU_CLOSE)
			{
				strm_dmx_output_buffer_flag = GPU_CLOSE;
				printf("Closing Stream Demux Function \r\n");
				break;
			}
			else
			{
//				usleep(10);
			}
	}
}



// CUDA kernel declaration (without returning anything)
__global__ void gpu_kernel1(UWord8* gpu_input_buffer_1, UWord8* gpu_input_buffer_2, UWord8* stream_demux_input_buffer)
{

	printf("GPU Kernel 1:  Waiting for Buffer \r\n");

    while (true)
    {
    	if(gpu_input_buffer_1_flag == BUFFER_FILLED)
    	{
    		do
    		{
    			gpuSleep(675);
    		}while(strm_dmx_input_buffer_flag != BUFFER_EMPTY);

			memcpy(stream_demux_input_buffer, gpu_input_buffer_1, gpu_input_buffer_1_size);

    		strm_dmx_input_buffer_size = gpu_input_buffer_1_size;

    		strm_dmx_input_buffer_flag = BUFFER_FILLED;

    	    gpu_input_buffer_1_flag = BUFFER_EMPTY;
    	}
    	else if(gpu_input_buffer_2_flag == BUFFER_FILLED)
		{
    		do
    		{
    			gpuSleep(675);
    		}while(strm_dmx_input_buffer_flag != BUFFER_EMPTY);

			memcpy(stream_demux_input_buffer, gpu_input_buffer_2, gpu_input_buffer_2_size);

    		strm_dmx_input_buffer_size = gpu_input_buffer_2_size;

    		strm_dmx_input_buffer_flag = BUFFER_FILLED;

    	    gpu_input_buffer_2_flag = BUFFER_EMPTY;
		}
    	else if(gpu_input_buffer_1_flag == GPU_CLOSE || gpu_input_buffer_2_flag == GPU_CLOSE)
    	{
    		printf("Closing GPU Kernel 1 Thread \r\n");
    		strm_dmx_input_buffer_flag = GPU_CLOSE;
    		break;
    	}
    	else
    	{
    		gpuSleep(337837); // Sleep for 500usec // GPU clock Freq = 675Mhz
    	}

    }
}

// CUDA kernel declaration (without returning anything)
__global__ void gpu_kernel2(UWord8* gpu_output_buffer_1, UWord8* gpu_output_buffer_2, UWord8* stream_demux_output_buffer)
{

	printf("GPU Kernel 2: Waiting for Buffer \r\n");

    while (true)
    {
    	if(strm_dmx_output_buffer_flag == BUFFER_FILLED)
    	{
    		if(gpu_output_buffer_1_flag == BUFFER_EMPTY)
    		{
    			memcpy(gpu_output_buffer_1, stream_demux_output_buffer, strm_dmx_output_buffer_size);

    			gpu_output_buffer_1_size = strm_dmx_output_buffer_size;
    			gpu_output_buffer_1_flag = BUFFER_FILLED;
    			strm_dmx_output_buffer_flag = BUFFER_EMPTY;
    		}
    		else if(gpu_output_buffer_2_flag == BUFFER_EMPTY)
    		{
    			memcpy(gpu_output_buffer_2, stream_demux_output_buffer, strm_dmx_output_buffer_size);

    			gpu_output_buffer_2_size = strm_dmx_output_buffer_size;
    			gpu_output_buffer_2_flag = BUFFER_FILLED;
    			strm_dmx_output_buffer_flag = BUFFER_EMPTY;
    		}
        	else
        	{
        		gpuSleep(675);
        	}
    	}
    	else if(strm_dmx_output_buffer_flag == GPU_CLOSE)
    	{
    		gpu_output_buffer_1_flag = GPU_CLOSE;
    		gpu_output_buffer_2_flag = GPU_CLOSE;

    		printf("Closing GPU Kernel 2 Thread \r\n");

    		break;
    	}
    	else
    	{
    		gpuSleep(675);
    	}
    }
}
void allocateGPUBuffers()
{
    cudaMalloc((void **)&gpu_input_buffer_1, BUFFER_SIZE * sizeof(char));
    cudaMalloc((void **)&gpu_input_buffer_2, BUFFER_SIZE * sizeof(char));
    cudaMalloc((void **)&stream_demux_input_buffer, BUFFER_SIZE * sizeof(char));
    cudaMalloc((void **)&stream_demux_output_buffer, BUFFER_SIZE * sizeof(char));
    cudaMalloc((void **)&gpu_output_buffer_1, BUFFER_SIZE * sizeof(char));
    cudaMalloc((void **)&gpu_output_buffer_2, BUFFER_SIZE * sizeof(char));
}

void DeallocateGPUBuffers()
{
	cudaFree(gpu_input_buffer_1);
	cudaFree(gpu_input_buffer_2);
	cudaFree(stream_demux_input_buffer);
	cudaFree(stream_demux_output_buffer);
	cudaFree(gpu_output_buffer_1);
	cudaFree(gpu_output_buffer_2);
}

void gpu_input_process()
{
	cudaStream_t stream1, stream2;

    // Allocate GPU Buffers
	allocateGPUBuffers();

	cudaStreamCreate(&stream1);
	cudaStreamCreate(&stream2);
	
	cout<<"Host Side: Before GPU Kernels " << endl;
	
	gpu_kernel1<<<1, 1, 0, stream1>>>(gpu_input_buffer_1, gpu_input_buffer_2, stream_demux_input_buffer);
	gpu_kernel2<<<1, 1, 0, stream2>>>(gpu_output_buffer_1, gpu_output_buffer_2, stream_demux_output_buffer);

	cout<<"Host Side: Before Stream Demux OpenACC Async" << endl;
	#pragma acc data deviceptr(stream_demux_input_buffer, stream_demux_output_buffer)
	#pragma acc kernels async(1)
	{
		stream_demux1(stream_demux_input_buffer, stream_demux_output_buffer);
	}
	cout<<"Host Side: After Stream Demux OpenACC Async" << endl;
	while(1)
	{
		cout<<"CPU Function Running" << endl;
		sleep(1);
		// Logic FOR CPU
	}

	DeallocateGPUBuffers();

	cudaStreamDestroy(stream1);
	cudaStreamDestroy(stream2);

}

int main(int argc, char **argv)
{

	gpu_input_process();

	return 0;
}

Ah, that warning. nvc++ is not meant as a replacement for nvcc for CUDA compilation. We needed to add some CUDA for our STDPAR implementation which uses Thrust under the hood.

The warning is just there to say that nvc++ doesn’t fully support CUDA so if there is a problem with the CUDA code, then it’s not something we’d necessarily fix. If it works, great!, ignore the warning and continue on. If it doesn’t, let me know and we can see about finding a work around. I just wouldn’t submit a bug report for the CUDA code.

If this isn’t a good fit for you, what you’d want to instead do is split the CUDA and OpenACC sections into separate files. Compile the CUDA code with nvcc and the OpenACC with nvc++, though link with nvc++.

Alternately, you can compile with nvcc and set the host compiler to be nvc++. The problem there is that nvcc doesn’t know to register the OpenACC kernels during link, so you want to link with nvc++ and likely need to set the host flags to use “-gpu=nordc” since nvcc disable RDC by default.

Hi @MatColgrove Thank you for this clarification.

For now, I have started running my code after compiling it with the nvc++ compiler, ignoring the warnings related to kernel launches.

This is the test code I am running on the Jetson AGX Xavier. It contains three kernels, each executing with a single block and a single thread:

  • Kernel 1 and Kernel 2 communicate via a flag in device memory. Kernel 1 sets the flag, and Kernel 2 reads it. This process runs continuously inside a while(1) loop.
  • Kernel 3 waits for a flag from the host (CPU). The CPU sets the flag, and Kernel 3 reads it and prints the value.

I compiled the code using: nvc++ -cuda test_app.cu

However, when I run the code, I do not see any prints from the kernel threads. Additionally, my host side gets stuck at cudaMemcpy when setting the flag for Kernel 3.

Checking the GPU status with tegrastats, I see: GR3D_Freq 99%@675

This indicates that all GPU cores are being utilized at full capacity.

Given that my code only involves a few threads (2-3 at most), I would expect it to use only a small fraction of the 512-core GPU. Why is it consuming all the GPU cores?

Could you please review my code and help me understand what mistake I might be making? Also, how can I optimize my implementation to align with the intended logic?

Thanks!

Following is the test code I am using:

include < iostream>
include <cuda_runtime.h>
include <unistd.h>

global void kernel1(volatile unsigned int* flag1) {
while (true) {
printf(“Kernel1: Setting flag1.\n”);
atomicExch((unsigned int*)flag1, 1);
__threadfence(); // Ensure visibility to other threads
__nanosleep(10000000); // Sleep for 10ms
}
}

global void kernel2(volatile unsigned int* flag1) {
while (true) {
if (atomicCAS((unsigned int*)flag1, 1, 0) == 1) { // Only clear if set
printf(“Kernel2: Received flag1. Clearing flag1.\n”);
__threadfence();
}
__nanosleep(10000000); // Sleep for 10ms
}
}

global void kernel3(volatile unsigned int* flag2) {
while (true) {
if (atomicCAS((unsigned int*)flag2, 1, 0) == 1) { // Check and clear flag
printf(“Kernel3: Received flag2 from host. Clearing flag2.\n”);
__threadfence();
}
__nanosleep(10000000); // Sleep for 10ms
}
}

int main() {
unsigned int h_flag2 = 0; // Host flag
unsigned int *d_flag1, *d_flag2;

cudaMalloc((void**)&d_flag1, sizeof(unsigned int));
cudaMalloc((void**)&d_flag2, sizeof(unsigned int));

cudaMemset(d_flag1, 0, sizeof(unsigned int));
cudaMemset(d_flag2, 0, sizeof(unsigned int));

// Launch single-threaded kernels
kernel1<<<1,1>>>(d_flag1);
kernel2<<<1,1>>>(d_flag1);
kernel3<<<1,1>>>(d_flag2);

sleep(2);  // Allow kernels to start

printf("Host: Setting flag2 for Kernel3.\n");
h_flag2 = 1;
cudaMemcpy(d_flag2, &h_flag2, sizeof(unsigned int), cudaMemcpyHostToDevice);

sleep(1);

cudaFree(d_flag1);
cudaFree(d_flag2);

return 0;

}

I don’t know Tegra systems myself nor tegrastats, but it looks like this is the percent utilization relative to the frequency being used. If there’s at least one thread running on a multiprocessor running (I think there’s 8 SMs), then most likely the whole multiprocessor needs to be powered up. So I’d interrupt this number to mean the % of cores powered at this frequency, not the number of cores actively being used. Also, I think the max freq on this is system ~1.4MHz and nominal freq at 854Mhz, so at 675 it’s not running with full frequency.

Granted, I’m just guessing about how to interrupt the “GR3D_Freq”, but if you’re concern is that more than one thread is being launched, then this isn’t the case. Now there will always be at least one warp (32 threads) running, though only one thread will actually be doing anything.

As for the algorithm, you might be better off to ask the question over on the CUDA forum, as I’m not an expert in CUDA (I advise on using the NVHPC SDK and directive base models).

Though, I don’t think this is going to work. Essentially you’re trying to create a mutex that’s visible across kernels. Though atomic operations work by flushing any reads and evicting the variable from cache. This then forces other threads in the same kernel to reread the variable. Though in the other kernel, it’s likely still reading from cache, not knowing the global variable has been updated.

Maybe there’s a way to do this, and perhaps someone on the CUDA forums can give you ideas?

One idea (and I could be wrong in this approach), is if you have a main kernel with one thread that handles the control flow. Instead of launching multiple kernels from main, instead within this main kernel, you then launch other kernels that perform the work (i.e. dynamic/nested parallelism)?

Now if you go this route, you’ll need to stick with just CUDA as we don’t support nested parallelism in OpenACC. Also, if you’re using pure CUDA, then you should use nvcc in case you encounter a construct nvc++ doesn’t support.