Using an OpenMP thread for GPU traffic

I have a problem that is embarassingly parallel that I have used OpenACC to parallelize on GPUs, but I have recently been thinking about this problem. Right now, the CPU deals with sending whatever data the GPU will need for calculation, and then it waits to receive the data back from the GPU once it is done. However, I want the CPU to split up some of the work that the GPU is doing all of. For example, I want to do something like this:

#pragma omp parallel
{

#pragma omp single nowait
{
// OpenACC calls here, where the GPU would do a large ratio
// of the total number of iterations to be done
}

#pragma omp single
{
cpu_for_loop()
}
}

Where cpu_for_loop() is something like

void cpu_for_loop()
{
#pragma omp parallel for
for (remaining number of iterations) {
// Calculations
}
return
}

However, it my attempts dealing with this I cannot get it to work. I have set omp_set_nested(1), omp_set_max_active_levels(2), and many combinations of these. I also started out with the cpu_for_loop() function written directly into the parallel region, but I was reading that nested parallelism with the PGI compiler is only supported when it is embedded into a function call.

Thanks for any help with this.

-Anthony

Hi Anthony,

You probably want to do something more like:

// Add a data region here
#pragma acc data copy(.. vars ...)
{

// OpenACC calls here, where the GPU would do a large ratio 
// of the total number of iterations to be done 
#pragma acc parallel loop default(present) async
for (...)  {

}

// Using the async clause will have the host code continue
// executing after the end of the OpenACC parallel region 

// Next start the CPU parallel loops
#pragma omp parallel for 
for (remaining number of iterations) { 
// Calculations 
} 

// have the CPU wait for the GPU computation to finish
#pragma acc wait

}  // end the OpenACC data region and copy back the data

Of course, this will only work if you have no data dependencies between the iterations. Load balancing between the two (.i.e. how many iterations to schedule on each) can be tricky as well.

Hope this helps,
Mat

Mat,

This seems to solve the problem and the work is being split up correctly. Thank you for your help, you have saved me a lot of time.

Anthony