Combining OpenMP and OpenACC

I would like to offload parts of a big loop to the GPU, using OpenACC (relying on managed memory and page migration on a Pascal GPU), and do part of it on the CPU using OpenMP. My idea was to use nested parallelism, however, the nested parallel for loop is not being run by multiple threads.
One could potentially start the acc loop with async, but then I won’t be able to measure its runtime. What is the nicest (or just working) way for doing this?

#pragma omp parallel num_threads(2)
    {
        if (omp_get_thread_num()==0)
        {
        //compute_rhs_GPU
            double ratio=find_bandwidth_ratio(9);
           
            gt1 = omp_get_wtime();
            #pragma acc parallel loop present(f[imax + 2][jmax + 2], g[imax + 2][jmax + 2], \
                                              rhs[imax + 2][jmax + 2], flag[imax + 2][jmax + 2])
            for (int i=(int) (imax*ratio)+1;i<=imax;i++) {
                for (int j=1;j<=jmax;j++) {
                    if (flag[i][j] & C_F) {
                        /* only for fluid and non-surface cells */
                        rhs[i][j] = (
                           (f[i][j]-f[i-1][j])/delx +
                           (g[i][j]-g[i][j-1])/dely) / del_t;
                    }
                }
            }
            gt2 = omp_get_wtime();
           
        } else {
        //compute_rhs_CPU  
            double ratio=find_bandwidth_ratio(8);
            ct1 = omp_get_wtime();
            #pragma omp parallel for
            for (int i=1;i<=(int) (imax*ratio);i++) {
                for (int j=1;j<=jmax;j++) {
                    if (flag[i][j] & C_F) {
                        /* only for fluid and non-surface cells */
                        rhs[i][j] = (
                           (f[i][j]-f[i-1][j])/delx +
                           (g[i][j]-g[i][j-1])/dely) / del_t;
                    }
                }
            }
            ct2 = omp_get_wtime();
        }
    }

My idea was to use nested parallelism, however, the nested parallel for loop is not being run by multiple threads.

OpenMP nested parallel parallelism is disabled by default but is enabled when setting the environment variable “OMP_NESTED=true”. See http://www.pgroup.com/resources/docs/17.10/x86/pgi-user-guide/index.htm#openmp-env-varsfor details.

Overall, I don’t think the strategy is bad, though what I typically do is use MPI instead of OpenMP and then compile the code using “-ta=multicore,tesla:cc60,managed” to create a single binary which can target either a Pascal or a Multicore CPU. Then at runtime, call the routine “acc_set_device_type” to either “acc_device_nvidia” or “acc_device_host” to select which target each rank should use.

One thing that I’ve been wanting to try but haven’t have time, is to do the same thing but within an OpenMP program. Something like the following:

#pragma omp parallel num_threads(2) 
    { 
        int start, end, tid;
        tid = omp_get_thread_num();
        if (tid==0) 
        { 
             start = (imax*ratio)+1;
             end = imax;
             acc_set_device_type(acc_device_nvidia);
        } else {
             start = 1;   
             end = (imax*ratio);
             acc_set_device_type(acc_device_host);
        }
            double ratio=find_bandwidth_ratio(9); 
            
            gt1[tid] = omp_get_wtime(); 
            #pragma acc parallel loop present(f[imax + 2][jmax + 2], g[imax + 2][jmax + 2], \ 
                                              rhs[imax + 2][jmax + 2], flag[imax + 2][jmax + 2]) 
            for (int i=start;i<=end;i++) { 
              #pragma acc loop vector
                for (int j=1;j<=jmax;j++) { 
                    if (flag[i][j] & C_F) { 
                        /* only for fluid and non-surface cells */ 
                        rhs[i][j] = ( 
                           (f[i][j]-f[i-1][j])/delx + 
                           (g[i][j]-g[i][j-1])/dely) / del_t; 
                    } 
                } 
            } 
            gt2[tid] = omp_get_wtime(); 
  }

I’m not 100% sure it will work as expected, but it would be interesting to try.

Note that I added a “loop vector” pragma to the “j” loop. Alternatively, you might try collapsing the loops. Either should give you better GPU performance since you’ll be using more parallelism and have better data access.

-Mat

Hi Mat,

Targeting both CPU and GPU with ACC is a very good idea - we tried doing this but the compiler gives a Warning that multicore execution was disabled

PGC-W-0155-OpenACC multicore code disabled inside OpenMP parallel construct  (simulation.c: 62)

Is there a way to convince the compiler otherwise?

The behavior makes sense. OpenMP lets you determine what each
thread will do, and how many threads to run. OpenACC is more
telling what work needs to be done, and what resources to use.
You can see how using them together would be confusing.

Nested parallelism in OpenMP is supported on the CPUs, but not on the GPU.

I think OpenACC works by running all on the GPU or all on the
host - single or multicore. It does not divide the work between the two.

Another way to look at this to run a two thread OpenMP program on the CPU. On one thread, a portion of the work is done on multicore with OpenACC.
One the other thread the rest of the work is done on the gpu with OpenACC.
You would create the multicore only /gpu only versions of the
program, and determine the division of work.
If no GPUs available, each thread would run on the host only.

dave

Hi Dave,

Thanks for these - the last bit of your comment is what we are trying to do now, because it is by far the cleanest - you only need to write the computational loop once. But then we run into this problem of OpenACC targeting multicore isn’t allowed in an OpenMP region - is there any way around that? I am perfectly happy with the host target using all of the CPUs and the GPU target using all of the GPU. OpenMP is only really needed to facilitate concurrent execution on both - I am not sure what other way to do this. I could perhaps use async - but how do I measure times then for both?

Thanks,
Istvan