NVC doesn't support nested parallel regions on CPU

I recently had our system admins install nvhpc/21.5 on our cluster, and was giving it a test run. I was having trouble getting it to run some OpenMP code that we have written here, so I got to digging. Turns out, from section 2.20 of the HPC Compilers User’s Guide that “Additionally, nested parallel regions on CPU are not supported”. To verify I didn’t have bad code, and that this was in fact true, I tested it using this code found from https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.Examples.pdf :

#include <stdio.h>
#include <omp.h>
int main (void) {
    omp_set_nested(8);
    omp_set_max_active_levels(8);
    omp_set_dynamic(0);
    omp_set_num_threads(2);
    #pragma omp parallel
    {
        omp_set_num_threads(3);
        #pragma omp parallel
        {
            omp_set_num_threads(4);
            #pragma omp single
            {
                /*
                * The following should print:
                * Inner: max_act_lev=8, num_thds=3, max_thds=4
                * Inner: max_act_lev=8, num_thds=3, max_thds=4
                */
                printf ("Inner: max_act_lev=%d, num_thds=%d, max_thds=%d\n",
                omp_get_max_active_levels(), omp_get_num_threads(),
                omp_get_max_threads());
            }
        }
        #pragma omp barrier
        #pragma omp single
        {
            /*
            * The following should print:
            * Outer: max_act_lev=8, num_thds=2, max_thds=3
            */
            printf ("Outer: max_act_lev=%d, num_thds=%d, max_thds=%d\n",
            omp_get_max_active_levels(), omp_get_num_threads(),
            omp_get_max_threads());
        }
    }
    return 0;

}

Is there a reason that nvc doesn’t support nested parallel regions on the CPU?

Prioritization of feature implementation in our new NVOMP runtime.

It’s rare to find an application that utilizes nested parallel loop so this became a low priority item. Things like adding GPU offload took precedence. It’s a known missing feature so we should add it at some point but I don’t know when.

Do you have a compelling case that I can use to justify to engineering to put it higher in the list of items to implement?

-Mat

I can’t include the exact code, but hopefully this pseudocode will illustrate what we are trying to do. Basically what we have right now are two concurrent CPU regions, with a barrier afterwards, that set up the GPU region.

#pragma omp parallel
{
    #pragma omp single nowait
    {
        // Set up memory structure for GPU Monte Carlo run.
        // This has a parallel for loop in it where each variable's
        // dispersion is handled separately, in parallel.
    }

    #pragma omp single
    {
        // Run smaller, 'nominal' simulation on CPU not GPU because
        // size of simulation is too small to leverage GPU.
        // This has a parallel for loop within it to parallelize
        // across the CPU.
    }

    // There is an implicit barrier at the end of the single region
    // but make it explicit.
    #pragma omp barrier

    #pragma omp single
    {
        // Launch Monte Carlo on the GPU here.
        // This region is dependent on the previous two single regions' completion,
        // but those two regions have no dependencies between them.
    }
} // End parallel region

What model are you using for the GPU offload? OpenMP Target, OpenACC, CUDA?

What is your intent in using the outer omp parallel loop? Are you using it to support multi-gpus (or multiple concurrent CPU threads on the same GPU)? Are you using it simulate asynchronous behavior? Both? Something else?

With OpenACC, this is fairly trivial to accomplish using an outer loop (one iteration per device) and then using the “async” clause. Our implementation on ‘async’ will utilize CUDA streams which allows for setting dependency graphs and concurrent host and device execution.

With OpenMP, unfortunately the ‘depends’ uses CPU tasks and doesn’t take advantage of CUDA streams so this may be a bit more difficult.

Note that for the nominal simulation may be better run on the GPU if you have data dependencies since the cost of data movement may outweigh the low compute on the GPU. I don’t know your code so can be sure, but something to consider.

-Mat