Combining OpenMP and OpenACC

ireguly27086 · November 9, 2017, 12:39pm

I would like to offload parts of a big loop to the GPU, using OpenACC (relying on managed memory and page migration on a Pascal GPU), and do part of it on the CPU using OpenMP. My idea was to use nested parallelism, however, the nested parallel for loop is not being run by multiple threads.
One could potentially start the acc loop with async, but then I won’t be able to measure its runtime. What is the nicest (or just working) way for doing this?

#pragma omp parallel num_threads(2)
    {
        if (omp_get_thread_num()==0)
        {
        //compute_rhs_GPU
            double ratio=find_bandwidth_ratio(9);
           
            gt1 = omp_get_wtime();
            #pragma acc parallel loop present(f[imax + 2][jmax + 2], g[imax + 2][jmax + 2], \
                                              rhs[imax + 2][jmax + 2], flag[imax + 2][jmax + 2])
            for (int i=(int) (imax*ratio)+1;i<=imax;i++) {
                for (int j=1;j<=jmax;j++) {
                    if (flag[i][j] & C_F) {
                        /* only for fluid and non-surface cells */
                        rhs[i][j] = (
                           (f[i][j]-f[i-1][j])/delx +
                           (g[i][j]-g[i][j-1])/dely) / del_t;
                    }
                }
            }
            gt2 = omp_get_wtime();
           
        } else {
        //compute_rhs_CPU  
            double ratio=find_bandwidth_ratio(8);
            ct1 = omp_get_wtime();
            #pragma omp parallel for
            for (int i=1;i<=(int) (imax*ratio);i++) {
                for (int j=1;j<=jmax;j++) {
                    if (flag[i][j] & C_F) {
                        /* only for fluid and non-surface cells */
                        rhs[i][j] = (
                           (f[i][j]-f[i-1][j])/delx +
                           (g[i][j]-g[i][j-1])/dely) / del_t;
                    }
                }
            }
            ct2 = omp_get_wtime();
        }
    }

MatColgrove · November 9, 2017, 11:16pm

My idea was to use nested parallelism, however, the nested parallel for loop is not being run by multiple threads.

OpenMP nested parallel parallelism is disabled by default but is enabled when setting the environment variable “OMP_NESTED=true”. See User's Guide :: PGI version 17.10 Documentation for x86 and NVIDIA Processors details.

Overall, I don’t think the strategy is bad, though what I typically do is use MPI instead of OpenMP and then compile the code using “-ta=multicore,tesla:cc60,managed” to create a single binary which can target either a Pascal or a Multicore CPU. Then at runtime, call the routine “acc_set_device_type” to either “acc_device_nvidia” or “acc_device_host” to select which target each rank should use.

One thing that I’ve been wanting to try but haven’t have time, is to do the same thing but within an OpenMP program. Something like the following:

#pragma omp parallel num_threads(2) 
    { 
        int start, end, tid;
        tid = omp_get_thread_num();
        if (tid==0) 
        { 
             start = (imax*ratio)+1;
             end = imax;
             acc_set_device_type(acc_device_nvidia);
        } else {
             start = 1;   
             end = (imax*ratio);
             acc_set_device_type(acc_device_host);
        }
            double ratio=find_bandwidth_ratio(9); 
            
            gt1[tid] = omp_get_wtime(); 
            #pragma acc parallel loop present(f[imax + 2][jmax + 2], g[imax + 2][jmax + 2], \ 
                                              rhs[imax + 2][jmax + 2], flag[imax + 2][jmax + 2]) 
            for (int i=start;i<=end;i++) { 
              #pragma acc loop vector
                for (int j=1;j<=jmax;j++) { 
                    if (flag[i][j] & C_F) { 
                        /* only for fluid and non-surface cells */ 
                        rhs[i][j] = ( 
                           (f[i][j]-f[i-1][j])/delx + 
                           (g[i][j]-g[i][j-1])/dely) / del_t; 
                    } 
                } 
            } 
            gt2[tid] = omp_get_wtime(); 
  }

I’m not 100% sure it will work as expected, but it would be interesting to try.

Note that I added a “loop vector” pragma to the “j” loop. Alternatively, you might try collapsing the loops. Either should give you better GPU performance since you’ll be using more parallelism and have better data access.

-Mat

ireguly27086 · November 12, 2017, 5:00pm

Hi Mat,

Targeting both CPU and GPU with ACC is a very good idea - we tried doing this but the compiler gives a Warning that multicore execution was disabled

PGC-W-0155-OpenACC multicore code disabled inside OpenMP parallel construct  (simulation.c: 62)

Is there a way to convince the compiler otherwise?

tull · November 13, 2017, 11:15pm

The behavior makes sense. OpenMP lets you determine what each
thread will do, and how many threads to run. OpenACC is more
telling what work needs to be done, and what resources to use.
You can see how using them together would be confusing.

Nested parallelism in OpenMP is supported on the CPUs, but not on the GPU.

I think OpenACC works by running all on the GPU or all on the
host - single or multicore. It does not divide the work between the two.

Another way to look at this to run a two thread OpenMP program on the CPU. On one thread, a portion of the work is done on multicore with OpenACC.
One the other thread the rest of the work is done on the gpu with OpenACC.
You would create the multicore only /gpu only versions of the
program, and determine the division of work.
If no GPUs available, each thread would run on the host only.

dave

ireguly27086 · November 14, 2017, 1:48am

Hi Dave,

Thanks for these - the last bit of your comment is what we are trying to do now, because it is by far the cleanest - you only need to write the computational loop once. But then we run into this problem of OpenACC targeting multicore isn’t allowed in an OpenMP region - is there any way around that? I am perfectly happy with the host target using all of the CPUs and the GPU target using all of the GPU. OpenMP is only really needed to facilitate concurrent execution on both - I am not sure what other way to do this. I could perhaps use async - but how do I measure times then for both?

Thanks,
Istvan

Topic		Replies	Views
Using an OpenMP thread for GPU traffic Legacy PGI Compilers	2	1476	September 4, 2018
Parallelize across CPU and GPU cores simultaneously Legacy PGI Compilers	3	5204	January 6, 2016
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10761	March 15, 2013
OpenACC: Best way to parallelize nested DO loops (continued) nvc, nvc++ and nvfortran	22	1553	March 28, 2023
OpenACC and OMP? Legacy PGI Compilers	1	1060	December 15, 2022
OpenMP and CUDA Legacy PGI Compilers	5	3999	October 12, 2017
OpenMP + OpenACC model Legacy PGI Compilers	3	2638	September 18, 2018
Compiling OpenMP & OpenACC for simultaneous execution Legacy PGI Compilers	3	2655	February 25, 2014
Using multiple GPUs Legacy PGI Compilers	7	22072	August 11, 2009
Hybrid runs on CPU and GPU - OpenACC nvc, nvc++ and nvfortran openmpi	6	1429	May 23, 2022

Combining OpenMP and OpenACC

Related topics