OpenMP + OpenACC model

TAO_T.CH · September 14, 2018, 9:34am

Hello,

Recently, i have a program using OpenACC to do offload work.

In my codes, I create an OpenACC device function to run in the parallel loop.

#pragma parallel loop deviceptr(A, B, C) 
{
    func(A, B, C);
}

The data are already copied to device and by using acc_deviceptr() runtime function, three device pointers A, B, C are obtained.

For some reason, i would like to use OpenMP doing multithreading work, for each one thread, an OpenACC kernel is distributed:

#pragma omp parallel num_threads()
{
       size_t tid = omp_get_thread_num();

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) {
            #pragma acc parallel loop deviceptr(A, B, C) 
           {
                func(A, B, C);
           }
       }
}

My question is, does PGI V17 support this kind of work? If not, what the other solution? Such as follow:

#pragma acc parallel loop private(i)         // run on CPU
for(int i = 0; i<n; ++i)
{
       #pragma acc loop deviceptr(A, B, C) // run on GPU
       {
           func(A, B, C);
       }
}

All codes are writen by using OpenACC directives, but the outside parallel region is targeting for CPU multithreading, the inside loop is targeting for GPU kernel.

Sincerely,
Tao[/b]

MatColgrove · September 14, 2018, 3:49pm

Hi Tao,

Yes, this should work fine with the PGI 2017 compilers.

Are there dependencies on A, B, or C between the OpenMP threads? i.e. are the OpenMP threads working of different A, B, and C arrays?

If no, then I’d recommend adding an async clause on the OpenACC parallel construct so the GPU loops are run concurrently on the GPU.

Also, since there’s no loop around “func”, remove the “loop” clause and be sure to make “func” a “routine gang”.

Something like:

#pragma acc routine gang 
void func(float *, float *, float *);
...

#pragma omp parallel num_threads() 
{ 
       size_t tid = omp_get_thread_num(); 

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) { 
            #pragma acc parallel deviceptr(A, B, C) async(tid)
           { 
                func(A, B, C); 
           } 
       } 
      #pragma acc wait
}

If you want to use multiple GPUs, be sure to set the device number for each OpenMP thread and that the device data is on the correct device number.

Finally, is there a reason why you’re using device pointers for A, B, and C instead of using data regions? It’s fine to do so, but does make your program less portable.

-Mat

TAO_T.CH · September 18, 2018, 8:58am

Hi Mat,

Thanks for your reply.

Are there dependencies on A, B, or C between the OpenMP threads? i.e. are the OpenMP threads working of different A, B, and C arrays?

No, there are no dependencies on these arrays between the OpenMP,
I’ll try to adding an async clause for the loop.

As for your question:

Finally, is there a reason why you’re using device pointers for A, B, and C instead of using data regions?

Since in my code, A, B, C are very large scale data, and there are plenty of cycles calling for kernel function. So if i use data construct to transfer data between host and device, it’ll be very inefficient.

Maybe i can use data present(host pointer) as follow:

#pragma omp parallel num_threads() 
{ 
       size_t tid = omp_get_thread_num(); 

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) { 
            #pragma acc data present(h_A,h_B,h_C) {
                 #pragma acc parallel loop
                { 
                    func(A, B, C); 
                }
            }
       } 
}

But i don’t see the difference of performance between using device pointers and data regions.

-Tao

MatColgrove · September 18, 2018, 1:23pm

Since in my code, A, B, C are very large scale data, and there are plenty of cycles calling for kernel function. So if i use data construct to transfer data between host and device, it’ll be very inefficient.

Granted, I don’t know your program, but I doubt using data regions higher in your code would be less efficient than using device pointers. Have you investigated using unstructured data regions?

Try something like:

float * A;
A=(float*) malloc(sizeof(float)*N);
...
#pragma acc enter data create (A[0:N],B[0:N],C[0:N])
...
// update the device data
#pragma acc update device(A[0:N])
...

#pragma omp parallel num_threads() 
{ 
       size_t tid = omp_get_thread_num(); 

       #pragma omp loop for 
       for (int i = 0; i < n; ++i) { 
                 #pragma acc parallel loop present(A,B,C)
                { 
                    func(A, B, C); 
                } 
       } 
}

-Mat

Topic		Replies	Views
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6178	November 14, 2017
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3859	August 26, 2019
OpenMP, OpenACC and acc_set_device_num Legacy PGI Compilers	12	10776	March 15, 2013
Device selection issue with OpenMP target mixed with do concurrent/OpenACC nvc, nvc++ and nvfortran	4	421	December 1, 2023
OpenMP + OpenACC problem Legacy PGI Compilers	9	5262	April 17, 2019
OpenACC and OpenMP data interoperability nvc, nvc++ and nvfortran cuda	1	1201	June 25, 2021
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	844	April 28, 2023
Parallelize across CPU and GPU cores simultaneously Legacy PGI Compilers	3	5220	January 6, 2016
How used my four gpu node Legacy PGI Compilers	6	4619	April 21, 2018
C array of pointers in OpenACC Legacy PGI Compilers	4	4995	August 26, 2015

OpenMP + OpenACC model

Related topics