I have a problem that is embarassingly parallel that I have used OpenACC to parallelize on GPUs, but I have recently been thinking about this problem. Right now, the CPU deals with sending whatever data the GPU will need for calculation, and then it waits to receive the data back from the GPU once it is done. However, I want the CPU to split up some of the work that the GPU is doing all of. For example, I want to do something like this:
#pragma omp parallel
{
#pragma omp single nowait
{
// OpenACC calls here, where the GPU would do a large ratio
// of the total number of iterations to be done
}
#pragma omp single
{
cpu_for_loop()
}
}
Where cpu_for_loop() is something like
void cpu_for_loop()
{
#pragma omp parallel for
for (remaining number of iterations) {
// Calculations
}
return
}
However, it my attempts dealing with this I cannot get it to work. I have set omp_set_nested(1), omp_set_max_active_levels(2), and many combinations of these. I also started out with the cpu_for_loop() function written directly into the parallel region, but I was reading that nested parallelism with the PGI compiler is only supported when it is embedded into a function call.
Thanks for any help with this.
-Anthony