Nested OpenMP overhead

Hi all,

I have a research code that employs a hybrid combination of MPI + OpenMP + CUDA Fortran. There is a certain part of the code that uses nested OpenMP. Initially, there are 2 threads in this part of the code. One of these threads then spawns 6 threads. When this inner region running the 6 threads terminates, there is an idle time of about 4 to 6 milliseconds before the code can proceed further. This is according to the profiling information provided by the nvprof profiler.

Is this idle time the normal overhead associated with nested OpenMP? Are there any environment variables that can help reduce or eliminate this idle time?

I have been setting the following environment variables:


The number of threads is set while the code runs, using the num_threads clause. I am using PGI compiler version 19.9.

That seems normal. I think nvvp can tell you for certain as it will give you a timeline that includes the driver and runtime API calls. You can verify after the six thread terminate, what is being done in those 4-6 milliseconds in the driver/runtime.

Thanks, I tried to run nvprof with OpenMP profiling enabled but it gave an error about “incompatible CUDA driver version”. This was with CUDA version 10.1.168 and PGI compiler version 19.9. I will see if there is anything else I can try.

I was previously using CUDA version 9.2.148 to profile my code without any issues but the older CUDA version does not provide OpenMP profiling support.