OpenMP threads busy during kernel execution

Hi,

for most parts of my program OpenMP statements are replaced with corresponding OpenACC statements. However, some parts of the program cannot be easily accelerated on the GPU (parts of pre and postprocessing) and still rely on multithreading.
I typically use ifdefs where I need to replace OpenMP directives with OpenACC directives.
What’s strange is that when my most time-consuming kernel runs on the GPU, which has been launched from a non OpenMP parallel section of the code, I still see all threads busy in top and notice a significant negative impact on the performance. OMP_WAIT_POLICY is set to passive.
If I make sure that omp_get_max_threads is 1 before the kernel is launched, performance is as expected. I don’t understand why this value should affect the performance of my OpenACC block if it is not nested within an OpenMP parallel region.

Thanks,
LS

Hi LS,

While the threads are still active (they’re spinning on a semaphore), they have called “sched_yield” (“or_sleep” on Windows) so will yield the processor as soon as any other process is active. There is a slight delay before the threads call sched_yield, but you can set the environment variable “MP_SPIN=1”, to lower the number of times they check the semaphore before calling sched_yield. (the default is 1000000).

I don’t understand why this value should affect the performance of my OpenACC block if it is not nested within an OpenMP parallel region

While I’m not sure why this would occur, my first thought is that it might be a binding issue. If the CPU thread is on a different NUMA node than the GPU, then this could cause slow downs.

How are you binding CPU threads to cores?

-Mat