Using an OpenMP thread for GPU traffic

rawill · August 31, 2018, 3:16pm

I have a problem that is embarassingly parallel that I have used OpenACC to parallelize on GPUs, but I have recently been thinking about this problem. Right now, the CPU deals with sending whatever data the GPU will need for calculation, and then it waits to receive the data back from the GPU once it is done. However, I want the CPU to split up some of the work that the GPU is doing all of. For example, I want to do something like this:

#pragma omp parallel
{

#pragma omp single nowait
{
// OpenACC calls here, where the GPU would do a large ratio
// of the total number of iterations to be done
}

#pragma omp single
{
cpu_for_loop()
}
}

Where cpu_for_loop() is something like

void cpu_for_loop()
{
#pragma omp parallel for
for (remaining number of iterations) {
// Calculations
}
return
}

However, it my attempts dealing with this I cannot get it to work. I have set omp_set_nested(1), omp_set_max_active_levels(2), and many combinations of these. I also started out with the cpu_for_loop() function written directly into the parallel region, but I was reading that nested parallelism with the PGI compiler is only supported when it is embedded into a function call.

Thanks for any help with this.

-Anthony

MatColgrove · August 31, 2018, 8:42pm

Hi Anthony,

You probably want to do something more like:

// Add a data region here
#pragma acc data copy(.. vars ...)
{

// OpenACC calls here, where the GPU would do a large ratio 
// of the total number of iterations to be done 
#pragma acc parallel loop default(present) async
for (...)  {

}

// Using the async clause will have the host code continue
// executing after the end of the OpenACC parallel region 

// Next start the CPU parallel loops
#pragma omp parallel for 
for (remaining number of iterations) { 
// Calculations 
} 

// have the CPU wait for the GPU computation to finish
#pragma acc wait

}  // end the OpenACC data region and copy back the data

Of course, this will only work if you have no data dependencies between the iterations. Load balancing between the two (.i.e. how many iterations to schedule on each) can be tricky as well.

Hope this helps,
Mat

rawill · September 4, 2018, 2:21pm

Mat,

This seems to solve the problem and the work is being split up correctly. Thank you for your help, you have saved me a lot of time.

Anthony

Topic		Replies	Views
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6178	November 14, 2017
Compiling OpenMP & OpenACC for simultaneous execution Legacy PGI Compilers	3	2656	February 25, 2014
OpenACC and OMP? Legacy PGI Compilers	1	1080	December 15, 2022
Parallelize across CPU and GPU cores simultaneously Legacy PGI Compilers	3	5220	January 6, 2016
OpenMP and CUDA Legacy PGI Compilers	5	3999	October 12, 2017
"invalid context" when mixing OpenMP, OpenAcc Legacy PGI Compilers	2	3230	January 31, 2014
NVC doesn't support nested parallel regions on CPU nvc, nvc++ and nvfortran	6	1204	August 25, 2023
acc copied data apparently not being copied back to CPU Legacy PGI Compilers	2	3876	June 9, 2017
combine the OpenMP with the OpenACC Legacy PGI Compilers	5	5433	April 22, 2014
Can I use OpenACC to parallelize a code with function calls? Legacy PGI Compilers	1	3498	August 7, 2015

Using an OpenMP thread for GPU traffic

Related topics