for most parts of my program OpenMP statements are replaced with corresponding OpenACC statements. However, some parts of the program cannot be easily accelerated on the GPU (parts of pre and postprocessing) and still rely on multithreading.
I typically use ifdefs where I need to replace OpenMP directives with OpenACC directives.
What’s strange is that when my most time-consuming kernel runs on the GPU, which has been launched from a non OpenMP parallel section of the code, I still see all threads busy in top and notice a significant negative impact on the performance. OMP_WAIT_POLICY is set to passive.
If I make sure that omp_get_max_threads is 1 before the kernel is launched, performance is as expected. I don’t understand why this value should affect the performance of my OpenACC block if it is not nested within an OpenMP parallel region.